This project is read-only.

Accuracy is not good as compared to Ghostscript library

Feb 19, 2013 at 12:52 PM
I have used Hocr2pdf Library and got the extracted text as "Hgusing". Similarly I used Ghostscript & Tesseract Library & got the extracted as "Housing" which is correct. The Issue is with the Image extraction from pdf which is in lower quality.
Mar 4, 2013 at 5:38 PM
Edited Mar 7, 2013 at 3:10 AM
By default, the GetPageImage(int pagenumber) function in the PDFReader class uses Itextsharp's internal method for getting the page image. If it has trouble getting it, then MuPDF (based off of Ghostscript) is used. The instances where ITextSharp cannot properly get the page image is in the case of a pdf that was already using JBIG2 or JPEG2000. A slight modification to the GetPageImage function will insure that it will always use the GhostScript based MuPDF and will hopefully provide better results for you. If you make any changes that improves the OCR quality, then please send me the changes so that I can integrate them.

You can also change the OcrController class to return a grayscale image instead of bitonal to try and get better results.