Download Wikipedia pages using API (in Python)
Closed - This job posting has been filled.
I have a list of 644 Wikipedia pages which you must process using python. (csv file is attached)
For each page you must provide me with :
1. a XML file with a list of all "revision" ids
2. for each revision, download the relevant Wikipedia page and store it as a HTML document
The final output is organized as follows:
1. a folder for each wikipedia page (eg: ./Pope_John_Paul_II/)
2. an xml file (Pope_John_Paul_II.xml) which contains a list of revisions in that folder. (use the following API: http://www.mediawiki.org/wiki/API:
3. a revisions folder with a html file named after each revision id (eg: ./Pope_John_Paul_II/revisions/55270
4. Python scripts for this entire process, which I will use to replicate your data collection process.