Machine Learning Developer for Text Classification
Closed - This job posting has been filled and work has been completed.
You: an engineer with recent experience in machine learning methods with specific applications in text based classification and topic modeling. If you are the perfect fit, this may even be easy for you!
We have: 100K small web page documents (blog posts converted to text) sorted roughly into 40 categories. This is an imperfect and unbalanced sorting, and one page can be categorized as anywhere between 0 to 5 categories, with most content being tagged as having 1 category. (Not the most friendly training data, but very “real world”)
We provide the text training data and associated categories. Our code can already produce HTML-stripped, stemmed, lowercase, tokenized text, so it is not necessary for the chosen Machine Learning library to handle stemming. You do not need to do any web scraping.
There are two parts:
Part 1: Create a Classifier.
The classifier will sort future content into the categories, and can also help to spot misclassified content. This may involve cross-folding analysis or some other method of running the classifier on the training content and spotting outliers. You must be skilled with the chosen algorithm and how to tune the various input parameters (SVM, Naive Bays, what kernel function to use, what parameters need optimization, etc.) as we will create the final classifier using our full dataset, and continue to update the classifier on a periodic basis as we gather more content.
The classifier must be built with one of the following libraries. If you suggest an alternate library that is better for some reason (speed, etc.) you can make your case for it, as long as it is something we can compile and run on CentOS with ~1GB of memory. Nothing too strange, and I don’t believe we can easily run matlab, sorry!
Java: Mallet, JSAT, Weka
Python: Gensim, Shogun, Liblinear
Part 2: Create a topic identifier.
Many (perhaps a quarter) of the documents within a category are going to be posts discussing a small set of “hot topics”. There are likely to be 10-100 “hot topics” within a category, with the specific topics changing over time. Within a category, create a topic identifier. This will likely be using LDA or HLDA, but if you know of a way to identify topic clusters of documents using other methods (TF/IDF clustering etc.) great, as long as you can explain why it is better! We don’t know how many “hot topics” exist within a category within a given week, but we can estimate.
We will then run the topic identifier on recent (last month) of contents to create a “topic list” for each category of anywhere from 10-100 topics. We then test each document to see if it falls within a document, with an expected roughly 1 in 5 chance of it fitting. Incoming content is also compared to the topic list for a match. We will re-generate the topics on new content weekly, so the algorithm does not have to be continually updated.
In both cases, you pick one or more algorithms, explain your choices, try them on the data, tune the parameters (again explaining in the deliverable how you went about the tuning process) and run it on the test data to produce a final output. We run the same algorithm on the full dataset, go though the tuning steps you described, and if all goes well, end up with an accurate, fast classifier and topic identifier!
You must be able to explain the following in your final deliverable (not right now):
* Why did you pick the algorithm you did?
* Is the chosen algorithm the most recent? Is it online/updatable? Does the dictionary (words to ints) cause any issues when saving the classifier for future use on unknown content?
* Where is this algorithm and method likely to have trouble? Is there any way to spot if it is having trouble on a category or type of document?
* What are the impacts of the various tuning parameters? How did you go about adjusting them to improve the output?
* What is the best quality we can expect from the chosen algorithm? Worst?
In your application for this job, please include:
1. How good you are at machine learning when it comes to text-based data
2. Have you recently done anything similar to the two tasks
3. What is your estimated level of effort
4. Does any part of the two tasks require learning on your part, how close to 100% "known technology" are these two tasks