I need to datamine some sites for their publicly available information. We will adhere to the specific sites TOS so that we don't consume too many resources/requests.
I'll provide specific URL's and a set of instructions for responsible datamining.
You'll need to understand how to parse html and js - fill out simple forms etc.
This application needs to be flexible..
1. the ability for our team to input different root URL's to crawl EASILY (GUI for this or webform)
2. Amount of LEVEL's to Crawl (limit crawler to domain links and not to get buried in thousands of levels with in domain or off domain)
3. Http status codes (404 etc.) recordings/capture
I would like this to be a HOSTED web application - written in python, ruby, node or even php - I was thinking that desktop application would be a good but in the long run I think a web application will be better maintained.
Also the needs that my team and I have - are common. SO this application could in the future be turned into a revenue source.