We have a program written in C# that we would like to convert to ruby.
The app does the following functions:
- Download attachments from an email address and store them in our database
- OCR any files where the text can not be extracted automatically.
- Monitor a number of URLs and check for changes in website content (HTML/pdf/word files/images). Stores a hash of page content and then compares against this to identify if page has changed.
Tesseract OCR looks like good option to use for OCR of documents.
Also useful gem for checking emails:
The pdfocr gem only OCR's pdf files. We would also like to OCR images also. To do this we should convert images to pdf before passing them to pdfocr gem. We can use docsplit (http://documentcloud.github.io/doc