pdfcomp.icon.png

Project Description
hOcr2Pdf.NET is a .NET library to create or convert .hocr html produced by Tesseract or Cuneiform into highly compressed searchable pdfs using HtmlAgilityPack, Jbig2 and iTextSharp. It is written in C#.


Features
  • Simple design. Create or edit pdf files with PDFDoc.Open() or PDFDoc.Create()
  • Easily add new scanned image pages
  • Ocr new or existing PDFs
  • Use different images for OCR and display
  • Optionally Define fonts to use for OCR output for perfectly underlayed text.
  • Compress PDFs with Jbig2
  • Provides common utility methods for searching, rotating, bookmarking, setting attributes such as title, author, etc...

Special thanks to the developers of:
http://soft.rubypdf.com/software/windows-version-jbig2-encoder-jbig2-exe --to encode images to jbig2
http://htmlagilitypack.codeplex.com/ --to parse the hocr files
http://code.google.com/p/tesseract-ocr/ --used for OCR and hocr output
https://launchpad.net/cuneiform-linux/ --used for OCR and hocr output
http://itextpdf.com/ --used to create/edit pdfs
http://www.ghostscript.com/download/gsdnld.html -- used for pdf page extraction

Last edited Feb 17 at 2:36 PM by pwizzle, version 60