I am looking to have Apache Nutch installed and have plugins created for it. Nutch is written in Java and so are the plugins.
I would like the following completed:
1. Install Nutch and Solr with Hadoop... on AWS with S3 as storage source.
2. Create a Nutch plugin that filters out pages that do not contain keywords (start with only one keyword).
3. Then create several (probably 5 or so) plugins that parse data from the pages as they are being indexed. Using something like a regular expression to identify the type of page and pull the data will work.
---Store the processed pages in a "Parsed" directory
4. Store the actual data in a flat file that changes names once per day (have date as part of name)
5. Store all remaining pages in an "Unprocessed" directory.
6. Solr should be able to search both "Unprocessed" and "Parsed" data
7. Set up Nutch to crawl the specified pages continuously
All crawling and scraping must be done in full compliance with the terms of service of the website and must comply with all applicable laws. Nutch helps follow robot rules, etc.