Use an open-source machine learning software package of your choice (Mallet, SVM, word-stemming, etc) to create a web page classification system. You must be familiar with the various machine learning algorithms, and how to do basic optimization and tuning of the results to produce highly confident outputs.
We providing text training data pulled from web pages and blog posts, classified into one or more categories and converted to text-only output. We also provide test data without categories. You do not need to do any web-scraping.
You then prepare that data using scripts or code that are part of the deliverable, transforming it into the appropriate format for the machine learning engine of choice.
You pick one or more algorithms, explain your choices, try them on the data, tune the parameters (again explaining in the deliverable how you went about the tuning process) and run it on the test data to produce a final output.
You must be able to explain the following in your deliverable:
How much is enough training data?
How to tell if the data is producing a high quality (or poor quality) classifier?
What are the impacts of the various tuning parameters? How did you go about adjusting them to improve the output?
Which algorithm is best for the job?
What is the best quality we can expect from the chosen algorithm?
How can we detect that the algorithm is having trouble with a specific instance and needs manual intervention?
The language may be java or python .