Looking for a Web-Scraping Expert

Looking for a Web-Scraping Expert

Cancelled

Job Description

Please add the following line to your application, if you do not add this line to your application, it WILL BE DELETED to help combat spam applicants.

** I have at LEAST 2 years of experience in web scraping, pattern-recognition &/or data-entry **

• You must be able to work a minimum of 4 hours / day, 5 days / week.
• You must be able to create specific lists, and test them to make sure they contain proper strings / footprints.
• You must know what a footprint is, and how to gather footprints.
• You should be comfortable with working with multiple languages.

As an example, and to prove that you are capable of this job - please provide a footprint to find drupal blogs. "Powered by Drupal" will not be accepted. It must be a multiple string footprint, and can include "inurl:" and other Google operators.

Here is an example of what I'm looking for, this footprint of course, would be to harvest phoca guestbooks:

inurl:com_phocaguestbook "( *@* ):" "Content" "Image Verification" "* March 2013"

This is an example of what your drupal footprint should look like. If you do not include a footprint example for Drupal blogs, you will not be considered for this position.

That said, your daily job will be to gather footprints for a specific CMS/platform & it's language variants, then load this CMS/platform into Gscraper and let the program scrape for results from that day to compile a list of unique domains / urls that are this platform/CMS.

At the beginning of each day, you will need to dedupe the results to unique domains or unique urls (depending on the CMS/platforms we scrape the day before) You will need to use regex &/or search/replace in Notepad++ to remove certain url additives, so that our results are on the first page.

For example, with phoca guestbooks, a lot of times when scraping a "&limitstart=#RANDOMNUMBER#" will be appended to the urls.

We would need to remove this string using something like \&limitstart\=(.*)$ and replace with blank, which would then bring us to the front page of the guestbook instead of a random page where we cannot verify links.

You will also need to analyze the CMS/footprints that you create, and then try to find other languages for this same footprint. You would use Google translate for example to convert "Guestbook" in English to "livre d'or" which is french, then find appropriate footprints for that language version of a CMS.

This will be a daily task, the scraping time of the software does not count towards your hours, only the time you are manually building and refining lists and footprints will.

If you're lazy, please do not apply. Looking for a hard worker, who is a fast learner, who is looking to gain more hours in the future, and possibly move to a full time position.

The pay for this job to start is $250 / month. Nothing will be paid upfront, you will be paid according to the hours you work under Odesk.

I look forward to working with you, looking to start this position in the next 24 hours.

---
Skills: english

Other open jobs by this client