Less than 1 month –
30+ hrs/week –
- Utilize PySpider Crawler based on Scrapy Web API
- Maintain a Index of Crawl Status
- Integrate with Zookeeper to monitor the state of the crawl nodes
- Integrate with Kafka Queue to dump crawled results
The PySpider is a opensource project implemented in python and it has very active developer community, so far we have identified it as one of the best solutions out there.
Developer Tasks include:
1) Create a vagrant box that can be started on demand and ...