Scalable Crawler in Python Presumably Scrapy

Scalable Crawler in Python Presumably Scrapy

Cancelled

Job Description

We have 1000s and 1000s of urls which we need crawled and representative images e.g product pictures as opposed to logos etc retrieved.Consequently we have established that this task can be comfortably accomplished by scrapy.
Once crawled and downloaded using lxml or beautifulsoup (or any good alternative - the contractor should be in a postion to advice as to the best parsing library which will be highly error tolerant and not resource hungry), sample "representative images" should be retrieved together with any relevant information we might deem usefull to our project.We would like to do the crawling on an amazon ec2 instance.Only apply if you have attempted such a project before.