OCR of a dataset in economic history


Job Description

I am a PhD student in Economics in the US and I just started a research project in economic history. I need to have parts of a historical book into an Excel spreadsheet.

This dataset covers all the manufacturing industries in the United States during the period of the Great Depression (1929, 1931, 1933 and 1935). Ideally, I am looking for an Economics student who would be interested in using this dataset to write a paper or a thesis. In that case, this job would be a win-win for both of us.

The dataset covers about 300 industries, with one table to perform OCR in each industry. After the OCR, the dataset requires some more cleaning before being used (removing footnotes, converting characters, etc...)

I guess it would take me about 4 hours to do this job, but I would be glad to outsource it to someone else.

Miguel Morin

Detailed instructions: Download the "Biennial Census of Manufactures, 1935", zipped package, from the following website:


Unzip the file. Open Chapter 3. For each industry, record the page in the book that contains Table 1 "General statistics," and perform an OCR of the table.

If an industry has several sub-industries under a heading (for example, butter, cheese, and condensed milk), perform an OCR of sub-industries separately, not on the "total for the industry."

Verify that the numbers match. If there is a problem in the result of the OCR, fix it manually in Excel. Remove footnotes and dots or hyphens in the names.

Skills: english