Java Regular Expression / Web Scraping Expert

Closed - This job posting has been filled and work has been completed.

Job Description

Duration: up to 2 months

Task: Create web crawlers and parsers using our library

Details for Crawler:
---------------------------
[+] You will browse websites with Firefox + Firebug and analyze structure
[+] You MUST be familiar with regular expressions and Java
[+] You will implement a few Java classes per crawler
[+] Java classes will extract some information from an HTML page and create more requests using the information
[+] You may need to create a new project folder from a template
[+] You will need to edit an XML control file
[+] We have a library for crawling web sites that you must use
[+] We have some documentation. You will still have to figure out a lot of things on your own.


Details for Parser:
--------------------------
[+] Input files are text, html, or PDF files
[+] Parser is written in Java with regular expressions, html parsing, and/or PDF parsing implementing our interface
[+] Java class will extract the information from an HTML page or PDF document and pass it to our library for uploading to the repository
[+] We will score the output using an automatic tool
[+] Output is stored using a provided Java library
[+] There are training materials to teach you our library
[+] Correctness is valued over speed
[+] A validation tool is provided to help check output files (but will not catch all mistakes so you must be careful)

Each Deliverable:
--------------------------
[1] Pick up work on our TaskMan system
[2] Check out source code for that task
[3] Try running it locally
[4] Make appropriate modifications to the Java code until it is working ok
[5] Commit + push changes
[6] Use the TaskMan system to start a run of your code
[7] Check the score when it’s done
[8] Release the task on TaskMan to get credit

Expected workload:
-----------------------------
At least 10 tasks per week (about 1-4 hours per ticket)
( first few will take longer as you are unfamiliar with our library )

Communications:
--------------------------
Must be comfortable text chatting in English over Skype. Must check in online or on email 3 times per week.


Notes:
----------
** Ability to read and understand English web pages is a must. **

** Familiarity with US college/university record systems a plus **

** Being a college/university student or recent graduate is a plus. **