Pywikipedia

Pywikipedia

Info

en: Pywikipedia – kurs na anglickojazyčné Wikiversity
meta: Pywikipediabot – návod na metě
http://pywikipediabot.sourceforge.net/

Instalace

Předpokladem je, že máme funkční interpret jazyka Python. Instalace balíku Pywikipedia spočívá prakticky jen v jeho stažení a umístění do příslušného adresáře. Budeme postupovat podle m:Using the python wikipediabot.

Předpokládejme, že již máme pro různé experimenty s wiki vytvořený nějaký adresář ~/wiky Přepneme se do něj, vytvoříme si v něm adresář např. RObot a do něj jako další podadresář distribuci Pywikipedia. Tu můžeme stáhnot buď jako každonoční download anebo nejnovější verzi přes svn (k tomu musíme mít nainstalovaný balík subversion):

cd
cd wiki
mkdir RObot
cd RObot

a dále buď:

wget http://toolserver.org/~valhallasw/pywiki/package/pywikipedia/pywikipedia-nightly.tar.bz2
tar xjf pywikipedia-nightly.tar.bz2

anebo:

svn checkout http://svn.wikimedia.org/svnroot/pywikipedia/trunk/pywikipedia/ pywikipedia

V obou případech se nám v adresáři RObot vytvoří podadresář pywikipedia, obsahující celou distribuci.

Konfigurace

Před prvním použití knihovny Pywikipedia ji musíme zkonfigurovat. Nejjednodušší řešení je vygenerovat si konfigurační soubor pomocí skriptu generate_user_files.py:

cd pywikipedia
python generate_user_files.py

a jen odpovíme na jednotlivé otázky:

1: Create user_config.py file
2: Create user_fixes.py file
3: The two files
What do you do? 1
1: southernapproach
2: wikiversity
3: gentoo
4: supertux
5: betawiki
6: wikinews
7: krefeldwiki
8: wikibooks
9: wikia
10: wikibond
11: lyricwiki
12: botwiki
13: loveto
14: anarchopedia
15: test
16: i18n
17: freeciv
18: wikisource
19: battlestarwiki
20: incubator
21: memoryalpha
22: species
23: pakanto
24: wiktionary
25: mediawiki
26: mdc
27: wikipedia
28: wikitech
29: wikitravel_shared
30: mozilla
31: commons
32: omegawiki
33: openttd
34: uncyclopedia
35: wikiquote
36: wikitravel
37: wesolve
38: mac_wikicities
39: meta
Select family of sites we are working on (default: wikipedia): 2
The language code of the site we're working on (default: 'en'): cs
Username (cs wikiversity): RObot
'user-config.py' written.

Výsledný soubor user-config.py má asi 220 řádek, ale nejdůležitější na něm jsou řádky:

family = 'wikiversity'
mylang = 'cs'
usernames['wikiversity']['cs'] = u'RObot'

Pokud bude náš bot pracovat na více projektech, může mít v konfiguračním souboru více usernames, např:

usernames['wikiversity']['cs'] = u'RObot'
usernames['wikiversity']['en'] = u'RObot'
usernames['wikinews']['cs'] = u'RObot'

Nakonec se vrátíme zpět do adresáře RObot, ve kterém budeme vytvářet svoje první skripty.

cd ..

Stažení stránky: RObot-page.get.py

Dále zkusíme experimentovat podle en:Pywikipediabot. Pokud se prokoušeme až k 5. lekci, budeme mít v adresáři RObot hotový např. takovýto skript ke stažení domácí stránky Uživatel:RObot z české Wikiverzity:

#!/usr/bin/python
# -*- coding: utf-8 -*-

mydir = "./"
pwbdir = mydir + "pywikipedia/"
import sys
sys.path.append(pwbdir)
from wikipedia import *

language = "cs"
family = "wikiversity"
site = getSite(language,family)

pagename = "Uživatel:RObot"
page = Page(site,pagename)
pagetext = page.get()
print pagetext

Nastavíme našemu prvnímu botovi (tj. souboru s tímto skriptem) příznak executability a spustíme jej:

chmod u+x RObot-page.get.py
./RObot-page.get.py

Výsledkem je překvapení:

Checked for running processes. 1 processes currently running, including the current process.
Traceback (most recent call last):
  File "./RObot.py", line 15, in <module>
    page = Page(site,pagename)
  File "./pywikipedia/wikipedia.py", line 347, in __init__
    t = html2unicode(title)
  File "./pywikipedia/wikipedia.py", line 4060, in html2unicode
    result += text
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 1: ordinal not in range(128)

Na 15. řádku, kde vznikla chyba máme příkaz:

page = Page(site,pagename)

Mluví se tam cosi o kódování a tak tušíme problém v předchozím řádku:

pagename = "Uživatel:RObot"

Pokud tento řádek nahradíme ekvivalentem:

pagename = "Uživatel:RObot"

dostaneme kýženou stránku. Vkrádá se zlé tušení, že tvůrci knihovny Pywikipedia nějak pozapomněli na možnost znaků UTF-8. Po delším pátrání v bugreportech zjistíme, že chyba naštěstí není v knihovně, ale jen že si anglofonní pisatelé kursu na en:Pywikipediabot nedělali problém s nějakými nabodeníčky. A získáme poučení, že v Pythonu musíme všechny řetězce, které vkládáme v UTF-8, uvozovat znakem u. Správný zápis inkriminovaného řádku tudíž bude:

pagename = u"Uživatel:RObot"

Speciální stránky

Myslíme na první úkol pro našeho RObota, a to získání seznamu všech naposledy změněných stránek za určité období. Vypadá to, že není nic znažšího než v našem skriptu použít řádek:

pagename = u"Speciální:Poslední změny"

Výsledkem je chyba:

Checked for running processes. 1 processes currently running, including the current process.
Traceback (most recent call last):
  File "./RObot-page.get.py", line 19, in <module>
    pagetext = page.get()
  File "./pywikipedia/wikipedia.py", line 654, in get
    raise NoPage('%s is in the Special namespace!' % self.aslink())

Nepomůžou ani pokusy, uvést jméno stránky jako:

pagename = u"Speciální:Recentchanges"
pagename = u"Speciální:Recentchanges" 
pagename = "Special:Recentchanges"

Stále tatáž chybová hláška nám říká, že jméno stránky je ve jmenném prostoru Special, což sice víme, ale proč to tak vadí?

Zkusíme se podívat do zdroje. Najdeme soubor, který tu hlášku generuje:

grep 'Special namespace' pywikipedia/*
pywikipedia/wikipedia.py:            raise NoPage('%s is in the Special namespace!' % self.aslink())
Binární soubor pywikipedia/wikipedia.pyc odpovídá

Příslušný kus kódu nám říká:

        # NOTE: The following few NoPage exceptions could already be thrown at
        # the Page() constructor. They are raised here instead for convenience,
        # because all scripts are prepared for NoPage exceptions raised by
        # get(), but not for such raised by the constructor.
        # \ufffd represents a badly encoded character, the other characters are
        # disallowed by MediaWiki.
        for illegalChar in u'#<>[]|{}\n\ufffd':
            if illegalChar in self.sectionFreeTitle():
                if verbose:
                    output(u'Illegal character in %s!' % self.aslink())
                raise NoPage('Illegal character in %s!' % self.aslink())
        if self.namespace() == -1:
            raise NoPage('%s is in the Special namespace!' % self.aslink())
        if self.site().isInterwikiLink(self.title()):
            raise NoPage('%s is not a local page on %s!'
                         % (self.aslink(), self.site()))

Tušíme, že to má cosi dělat se třídou Page. Kde se definuje?

grep 'class Page' *
archivebot.py:class PageArchiver(object):
BeautifulSoup.py:class PageElement:
followlive.py:class PageHandler:
pagefromfile.py:class PageFromFileRobot:
pagefromfile.py:class PageFromFileReader:
wikipedia.py:class PageNotSaved(Error):
wikipedia.py:class PageNotFound(Error):
wikipedia.py:class Page(object):

Tak vše je zřejmě v tom největším souboru wikipedia.py. Předpokládáme, že bychom se tam mohli o tom něco dočíst:

class Page(object):
    """Page: A MediaWiki page

    Constructor has two required parameters:
      1) The wiki Site on which the page resides [note that, if the
         title is in the form of an interwiki link, the Page object may
         have a different Site than this]
      2) The title of the page as a unicode string

    Optional parameters:
      insite - the wiki Site where this link was found (to help decode
               interwiki links)
      defaultNamespace - A namespace to use if the link does not contain one

    Methods available:

    title                 : The name of the page, including namespace and
                            section if any
    urlname               : Title, in a form suitable for a URL
    namespace             : The namespace in which the page is found
    titleWithoutNamespace : Title, with the namespace part removed
    section               : The section of the page (the part of the title
                            after '#', if any)
    sectionFreeTitle      : Title, without the section part
    aslink                : Title in the form [[Title]] or [[lang:Title]]
    site                  : The wiki this page is in
    encoding              : The encoding of the page
    isAutoTitle           : Title can be translated using the autoFormat method
    autoFormat            : Auto-format certain dates and other standard
                            format page titles
    isCategory            : True if the page is a category
    isDisambig (*)        : True if the page is a disambiguation page
    isImage               : True if the page is an image
    isRedirectPage (*)    : True if the page is a redirect, false otherwise
    getRedirectTarget (*) : The page the page redirects to
    isTalkPage            : True if the page is in any "talk" namespace
    toggleTalkPage        : Return the talk page (if this is one, return the
                            non-talk page)
    get (*)               : The text of the page
    latestRevision (*)    : The page's current revision id
    userName              : Last user to edit page
    isIpEdit              : True if last editor was unregistered
    editTime              : Timestamp of the last revision to the page
    previousRevision (*)  : The revision id of the previous version
    permalink (*)         : The url of the permalink of the current version
    getOldVersion(id) (*) : The text of a previous version of the page
    getRestrictions       : Returns a protection dictionary
    getVersionHistory     : Load the version history information from wiki
    getVersionHistoryTable: Create a wiki table from the history data
    fullVersionHistory    : Return all past versions including wikitext
    contributingUsers     : Return set of users who have edited page
    exists (*)            : True if the page actually exists, false otherwise
    isEmpty (*)           : True if the page has 4 characters or less content,
                            not counting interwiki and category links
    interwiki (*)         : The interwiki links from the page (list of Pages)
    categories (*)        : The categories the page is in (list of Pages)
    linkedPages (*)       : The normal pages linked from the page (list of
                            Pages)
    imagelinks (*)        : The pictures on the page (list of ImagePages)
    templates (*)         : All templates referenced on the page (list of
                            Pages)
    templatesWithParams(*): All templates on the page, with list of parameters
    getReferences         : List of pages linking to the page
    canBeEdited (*)       : True if page is unprotected or user has edit
                            privileges
    botMayEdit (*)        : True if bot is allowed to edit page
    put(newtext)          : Saves the page
    put_async(newtext)    : Queues the page to be saved asynchronously
    move                  : Move the page to another title
    delete                : Deletes the page (requires being logged in)
    protect               : Protect or unprotect a page (requires sysop status)
    removeImage           : Remove all instances of an image from this page
    replaceImage          : Replace all instances of an image with another
    loadDeletedRevisions  : Load all deleted versions of this page
    getDeletedRevision    : Return a particular deleted revision
    markDeletedRevision   : Mark a version to be undeleted, or not
    undelete              : Undelete past version(s) of the page

    (*) : This loads the page if it has not been loaded before; permalink might
          even reload it if it has been loaded before

    """

Kromě spousty zajímavých věcí jmse možná hledali tuhle informaci:

    """
    Optional parameters:
      defaultNamespace - A namespace to use if the link does not contain one
    """

Je to divné, protože sz toho vyrozumívám, že jestliže název jmenného prostoru uvedu před jménem stránky, tak bych nemusel nastavovat defaultní jmenný prostor. Ale kdoví. Pythonu neznajíce, pokusíme se trochu experimentovat s kódem:

#!/usr/bin/python
# -*- coding: utf-8 -*-

mydir = "./"
pwbdir = mydir + "pywikipedia/"
import sys
sys.path.append(pwbdir)
from wikipedia import *

language = "cs"
family = "wikiversity"
site = getSite(language,family)
#defaultNamespace = u"Speciální"
defaultNamespace = u"Speciální:"
pagename = u"Poslední změny"

page = Page(site,pagename,defaultNamespace)
pagetext = page.get()
print pagetext

Obdržená chyba je zase o jakési chybě v kódování:

Checked for running processes. 1 processes currently running, including the current process.
Traceback (most recent call last):
  File "./RObot-page.get.py", line 22, in <module>
    page = Page(site,pagename,defaultNamespace)
  File "./pywikipedia/wikipedia.py", line 353, in __init__
    t = url2unicode(t, site = insite, site2 = site)
  File "./pywikipedia/wikipedia.py", line 3959, in url2unicode
    encList = [site.encoding()] + list(site.encodings())
AttributeError: 'unicode' object has no attribute 'encoding'

(Tak zatím jsem nezjistil, co s tím, třeba nám s tím někdo zkušenější poradí. Zkouším googlit:

pywikipedia special namespace

ale zatím bez většího úspěchu. --Kychot 28. 12. 2008, 14:10 (UTC))