Project Description
hOcr2Pdf.NET is a .NET library to create or convert .hocr html produced by Tesseract or Cuneiform into highly compressed searchable pdfs using HtmlAgilityPack, Jbig2 and iTextSharp. It is written in C#.

Download includes sample app to compress and ocr PDF files to jbig2 pdfs
Requires http://www.ghostscript.com/download/gsdnld.html

Source code and binaries in SVN are up-to-date

  • Simple/Flexible Design. Accomplish most tasks with only one class (PDFCreator).
  • Completely automated. Just add your images or PDFs to the PDFCreator and it will do the rest.
  • Provides callbacks to provide for further image processing during each stage (preprocess image for ocr, preprocess image for display)
  • Recompress/Process Images within existing PDFs without affecting OCR data.
  • Easily embed fonts into output.
  • ImageProcessor class provides utilities for Cropping, Deskewing, Image Conversions, Color Conversions, etc..
  • Set PDF properties such as Author, Title, etc...
  • Tools for reading existing PDFs (PDFReader) to exact text, page image, split PDF by bookmarks, apply watermark, etc..
  • Sample App (hocr2pdf-gui) to help guide you

Special thanks to the developers of:
http://soft.rubypdf.com/software/windows-version-jbig2-encoder-jbig2-exe --to encode images to jbig2
http://htmlagilitypack.codeplex.com/ --to parse the hocr files
http://code.google.com/p/tesseract-ocr/ --used for OCR and hocr output
https://launchpad.net/cuneiform-linux/ --used for OCR and hocr output
http://itextpdf.com/ --used to create pdfs
http://www.ghostscript.com/download/gsdnld.html -- used for pdf page extraction

Last edited Apr 9, 2014 at 12:58 PM by pwizzle, version 49