Text Extraction from Scanned PDF document

Text Extraction from Scanned PDF document

Closed - This job posting has been filled and work has been completed.

Job Description

I have some scanned(OCR) & vector PDF documents in my server. I need an expert in "tesseract", "ocropus" and/or "cuneiform" to suggest me a good solution to automate the task.

The task is to extract text content from PDF document and update in database. I already have the script written in PHP. It currently uses ocrwebservice.com webservice. But this webservice is not returning all the text found in a page image. So I am now looking for a better solution.

Your responsibilities:
1. You check "tesseract", "cuneiform" and "ocropus" and let me know which one could provide the best result
2. You tell me how to install the same in linux environment. I have ubuntu VPS server. I install it in my server and take your help, only when required.
3. I give you my current script. You change the calls from ocrwebservice.com to whatever we decide to use
4. I test the script and we complete the project.

For an expert in this area, it is going to be really simple and may be you could complete it in a day.

---
Skills: pdf, linux