We need to extract data and pictures from Wikipedia using the MediaWiki API to use them in our site.
The articles to be extracted have to meet the following criteria:
1.- They must have an abstract (i.e. a short definition that appears at the beginning of the article, usually before the list of contents). Using the MediaWiki API, the abstract is found under rvsection=0 (which also includes other data) 2.- That abstract has to have an extension between 100 and 140 characters, including spaces and excluding non-visible characters and the final period of the paragraph. An example of how to define the length of an abstract: "Hygiene practices include bathing, brushing and flossing teeth and washing hands especially before eating" That abstract has 105 characters, so it can be extracted, although the extracted string will include more characters that represent Wikipedia code. From the articles that meet the criteria for extraction, we need the following data and files:
1.- The title. 2.- The language of the article. 3.- The Page id (pageid) 4.- The abstract of the article. 5.- If existent, the first picture of the article, both the file and the name wikipedia gave that picture. It is also to be found within rvsection=0. The picture can be described in different ways in the code. All cases must be taken into account while programming the script. 6.- If existent, the picture tag, called "image_caption". 7.- If existent, links to the same article in other languages, to be found within the tag . 8.- If existent, all the categories of the article, to be found within the tag .
The delivery method will be the following: For the data, we need a mysql database with the following tables and columns: - table articles: id (pageid); lang_code; title; content; image; thumb; image_tag. - table lang_links: id, article_id; lang_code; title. - table categories: id; lang_code; name. - table articles_categories: article_id; category_id.
The relationships between the tables are the following: articles has many lang_links articles has and belongs to many categories
Instructions for downloading, editing and uploading the pictures in our server: - You will have to create a batch process to automatize the process of downloading the pictures. - Picture format can be jpg, gif or png (convert if necessary). - For each picture we need 2 copies: - one with the original proportions of the picture but of 640 px width, named the same as the original one. - one of 66x66 px (first reduce the picture so that the smaller size has 66 px and then crop it to leave the center so that the other size is also 66 px.). It will be named the same as the original but with "_th" added at the end. - Each pair of pictures has to be stored in a folder named "article_id". - The log-in data of our server to upload the pictures will be provided once the project is done.
About the Candidates: - We expect knowledge of mysql 5.x, and MediaWiki API. - Fluent English written and spoken is a must. If your mother tongue is Spanish or German you don’t need English. - You work fast, are very accurate and code cleanly - We would like to have a longer-term work relationship beyond this project with the chosen candidate. Further qualifications that we will need in the future are: CakePHP, xHTML, Javascript (prototype, jQuery), XML and CSS.
About us: We are a young startup based in Berlin, Germany, planning to outsource our whole development team through oDesk. This is our first project. If successful, we will post more jobs in different areas of development.
We will contact interesting candidates per email to coordinate a telephone/skype interview. |
|
10yrs as Data Entry Specialist, Internet Researcher, Graphic Designer
|
Jo-Ahn Mabelo
|
Provider
|
November 6, 2009 |
|
PHP MYSQL AJAX ECOMMERCE & SCRIPT PROGRAMMING
|
Simranpreet K.
|
Provider
|
November 6, 2009 |
|
PHP/LAMP/Joomla/SilverStripe Developer
|
Jagat P.
|
Provider
|
November 6, 2009 |
|
Top notch PHP/MySQL/Ajax Wordpress Joomla Drupal Magento Developer
|
Sarabjeet Dhillon
|
Provider
|
November 6, 2009 |
|
Php, Java, Ruby, Javscript, HTML, AJAX,XML,XSLT,SOAP, CRM developer
|
Mohan Rm
|
Provider
|
November 6, 2009 |
|
█ PROFESSIONAL DEVELOPER/ SCRAPER EXPERT-OFFICE MASTER█
|
Alex Cayoja
|
Provider
|
November 6, 2009 |
|
◆►★PHP/MYsql/HTML/Ajax/Joomla/Javascript/Smarty/Codeigniter★◄◆
|
Gurmeet Singh
|
Provider
|
November 7, 2009 |
|
Drupal, Joomla, Wordpress, PHP, AJAX, XHTML/CSS, 6+ yrs
|
Deepak A.
|
Provider
|
November 7, 2009 |
|
PHP, CakePHP and AJAX developer with HTML/CSS proficiency
|
Ashok G.
|
Provider
|
November 9, 2009 |
|
PHP/MySQL/CakePHP/jQuery Developer
|
Mindaugas N.
|
Provider
|
November 11, 2009 |
|
Php/MySQL, Drupal, Joomla, JavaScript / Ajax, C#-ASP/.Net
|
Bhaven K.
|
Provider
|
November 11, 2009 |
|
software development/content writing guru
|
Ankush A.
|
Provider
|
November 17, 2009 |
|