extract popular phrases from the text corpus
Closed - This job posting has been filled and work has been completed.
I need you to perform analysis on natural language text corpora.
you would be given a text corpous of ~600K sentences in english (with typos, slang etc) here is what you need to do create a script which:
1) extracts popular phrases (with frequencies) using various methods:
(you can use the ones from here http://www.quora.com/What-is-the-b
2) has a control panel, with parameters for the extraction (length of the n-grams, stop-words lists, work with adjectives, etc)
3) since the dataset contains a lot of spelling errors and slang there should also be a control which allows to work with substitution dictionaries (e.g. we provide a dictionary of popular typos and then run the extraction again with that dictionary in mind)
a sample of the text data is attached
if you have a ready solution (yours or know of someone else's) that is totally fine