All oDesk Jobs » Search Results » Job Facts
 oDesk Verified
Payment Method
Extract Wikipedia data and pictures
Open
Date Posted:November 6, 2009
Planned Start Date:November 6, 2009
Type:Hourly
Main Category:Web Development
Sub Category:Web Programming
Skills: MediaWiki API
MySQL
Estimated Workload:Part-time - 10-30 hrs/week
Estimated Duration:Less than 1 week
Last Buyer Activity:November 17, 2009
Candidates: 12 - average $12.01/hr
Interviews: 6 - average $13.65/hr
 
In Progress
Start Date: 
Hourly Rate: 
Last Date Worked: 
Hired Provider: 
Hours Worked: 
Offline Hours: 
Bonuses: 
Refunds: 
Completed
End Date: 
Feedback to Provider: 
Feedback to Buyer: 
 
Preferred Qualifications
English skill:above 5
Feedback score:above 4
Passed test:MySQL 5.0 Test
 
Buyer Facts
Member Since:September 29, 2009
Country:Germany (GMT+01)
City:Karlsruhe
Jobs Posted:1
Jobs Filled:0
Jobs Not Yet Filled:1
Current Team size:0
Hours billed, last 30 days:0
Total oDesk Hours:0.00
Feedback Score: -
 

We need to extract data and pictures from Wikipedia using the MediaWiki API to use them in our site. 


The articles to be extracted have to meet the following criteria:

1.- They must have an abstract (i.e. a short definition that appears at the beginning of the article, usually before the list of contents). Using the MediaWiki API, the abstract is found under rvsection=0 (which also includes other data)
2.- That abstract has to have an extension between 100 and 140 characters, including spaces and excluding non-visible characters and the final period of the paragraph. 
An example of how to define the length of an abstract:
"Hygiene practices include bathing, brushing and flossing teeth and washing hands especially before eating" 
That abstract has 105 characters, so it can be extracted, although the extracted string will include more characters that represent Wikipedia code.
 
From the articles that meet the criteria for extraction, we need the following data and files:

1.- The title.
2.- The language of the article.
3.- The Page id (pageid) 
4.- The abstract of the article.
5.- If existent, the first picture of the article, both the file and the name wikipedia gave that picture. It is also to be found within rvsection=0. The picture can be described in different ways in the code. All cases must be taken into account while programming the script.
6.- If existent, the picture tag, called "image_caption".
7.- If existent, links to the same article in other languages, to be found within the tag .
8.- If existent, all the categories of the article, to be found within the tag .

The delivery method will be the following:
For the data, we need a mysql database with the following tables and columns:
- table articles: id (pageid); lang_code; title; content; image; thumb; image_tag.
- table lang_links: id, article_id; lang_code; title.
- table categories: id; lang_code; name.
- table articles_categories: article_id; category_id. 

The relationships between the tables are the following:
articles has many lang_links
articles has and belongs to many categories

Instructions for downloading, editing and uploading the pictures in our server:
- You will have to create a batch process to automatize the process of downloading the pictures.
- Picture format can be jpg, gif or png (convert if necessary). 
- For each picture we need 2 copies:
 - one with the original proportions of the picture but of 640 px width, named the same as the original one.
 - one of 66x66 px (first reduce the picture so that the smaller size has 66 px and then crop it to leave the center so that the other size is also 66 px.). It will be named the same as the original but with "_th" added at the end.  
- Each pair of pictures has to be stored in a folder named "article_id".
- The log-in data of our server to upload the pictures will be provided once the project is done.

About the Candidates:
- We expect knowledge of mysql 5.x, and MediaWiki API.
- Fluent English written and spoken is a must. If your mother tongue is Spanish or German you don’t need English.
- You work fast, are very accurate and code cleanly
- We would like to have a longer-term work relationship beyond this project with the chosen candidate. Further qualifications that we will need in the future are: CakePHP, xHTML, Javascript (prototype, jQuery), XML and CSS.  

About us:
We are a young startup based in Berlin, Germany, planning to outsource our whole development team through oDesk. This is our first project. If successful, we will post more jobs in different areas of development.

We will contact interesting candidates per email to coordinate a telephone/skype interview.

Candidate List
Title Name Initiated By Date
10yrs as Data Entry Specialist, Internet Researcher, Graphic Designer Jo-Ahn Mabelo Provider November 6, 2009
PHP MYSQL AJAX ECOMMERCE & SCRIPT PROGRAMMING Simranpreet K. Provider November 6, 2009
PHP/LAMP/Joomla/SilverStripe Developer Jagat P. Provider November 6, 2009
Top notch PHP/MySQL/Ajax Wordpress Joomla Drupal Magento Developer Sarabjeet Dhillon Provider November 6, 2009
Php, Java, Ruby, Javscript, HTML, AJAX,XML,XSLT,SOAP, CRM developer Mohan Rm Provider November 6, 2009
█ PROFESSIONAL DEVELOPER/ SCRAPER EXPERT-OFFICE MASTER█ Alex Cayoja Provider November 6, 2009
◆►★PHP/MYsql/HTML/Ajax/Joomla/Javascript/Smarty/Codeigniter★◄◆ Gurmeet Singh Provider November 7, 2009
Drupal, Joomla, Wordpress, PHP, AJAX, XHTML/CSS, 6+ yrs Deepak A. Provider November 7, 2009
PHP, CakePHP and AJAX developer with HTML/CSS proficiency Ashok G. Provider November 9, 2009
PHP/MySQL/CakePHP/jQuery Developer Mindaugas N. Provider November 11, 2009
Php/MySQL, Drupal, Joomla, JavaScript / Ajax, C#-ASP/.Net Bhaven K. Provider November 11, 2009
software development/content writing guru Ankush A. Provider November 17, 2009

Related Trends
MySQL Administrators