Text Extraction from Scanned PDF document
Closed - This job posting has been filled and work has been completed.
I have some scanned(OCR) & vector PDF documents in my server. I need an expert in "tesseract", "ocropus" and/or "cuneiform" to suggest me a good solution to automate the task.
The task is to extract text content from PDF document and update in database. I already have the script written in PHP. It currently uses ocrwebservice.com webservice. But this webservice is not returning all the text found in a page image. So I am now looking for a better solution.
1. You check "tesseract", "cuneiform" and "ocropus" and let me know which one could provide the best result
2. You tell me how to install the same in linux environment. I have ubuntu VPS server. I install it in my server and take your help, only when required.
3. I give you my current script. You change the calls from ocrwebservice.com to whatever we decide to use
4. I test the script and we complete the project.
For an expert in this area, it is going to be really simple and may be you could complete it in a day.
Skills: pdf, linux