Research, find, crawl and collect all forms related to business licenses from government and/or private websites.
We are looking to collect an exhaustive list of all licenses/forms required by law for every type of business in the United States. Licenses and forms may be required on 4 levels: Federal, State, City and County. This means we need someone to do research to find all required licenses/forms by the Federal government, for all states, for all cities, for all counties, for all business types!
We are looking to index all licenses/forms and any metadata related to them, such as state, city, county, business type, required by which authority, description, name, and any other related information.
Ideally we would like an application built that uses any collection of APIs (e.g. Google search API) and frameworks (such as python scrappy framework) and also utilizes Natural Language Processing so that the crawlers do not depend on site structure, but rather the language used to locate and extract all required forms.
Application will run every few days or once a week (should be programmable by user running the application) and should be able to detect if a new form has been introduced or if an existing form has changed.
We will need a quote for both possibilities:
a) Having the application contained within a simple UI.
b) Scripts that may be run as a cronjob on a Linux server.
API, Python, Research, crawler, nlp, natural language processing