hOcr2Pdf.NET is a .NET library to create or convert .hocr html produced by Tesseract
into highly compressed searchable pdfs using HtmlAgilityPack, Jbig2 and iTextSharp. It is written in C#. Download includes sample app to compress and ocr PDF files to jbig2 pdfsRequires http://www.ghostscript.com/download/gsdnld.html Source code and binaries in SVN are up-to-dateFeatures
Special thanks to the developers of:http://soft.rubypdf.com/software/windows-version-jbig2-encoder-jbig2-exe
- Simple/Flexible Design. Accomplish most tasks with only one class (PDFCreator).
- Completely automated. Just add your images or PDFs to the PDFCreator and it will do the rest.
- Provides callbacks to provide for further image processing during each stage (preprocess image for ocr, preprocess image for display)
- Recompress/Process Images within existing PDFs without affecting OCR data.
- Easily embed fonts into output.
- ImageProcessor class provides utilities for Cropping, Deskewing, Image Conversions, Color Conversions, etc..
- Set PDF properties such as Author, Title, etc...
- Tools for reading existing PDFs (PDFReader) to exact text, page image, split PDF by bookmarks, apply watermark, etc..
- Sample App (hocr2pdf-gui) to help guide you
--to encode images to jbig2http://htmlagilitypack.codeplex.com/
--to parse the hocr fileshttp://code.google.com/p/tesseract-ocr/
--used for OCR and hocr outputhttps://launchpad.net/cuneiform-linux/
--used for OCR and hocr outputhttp://itextpdf.com/
--used to create pdfshttp://www.ghostscript.com/download/gsdnld.html
-- used for pdf page extraction