Project Description
hOcr2Pdf.NET is a .NET library to convert .hocr html produced by Tesseract or Cuneiform into searchable pdfs using HtmlAgilityPack and iTextSharp. It is written in C#.

NOTE I have not used the Source Code repository feature in a long time. Downloads include all source code as well as a Win32 compiled version of Cuneiform for Linux (w/ hocr output).

Features:
1. Easily create searchable PDF's from images specifying the image format for the PDF. Use the PdfReader class to compress and ocr existing pdfs.
2. Uses JBig2 compression for B/W, Greyscale images or Jpeg2000 for color images.
3. Can specify either Cuneiform or Tesseract for performing OCR
4. Add meta data to created PDFs (Author, Keywords, etc..)
5. Image deskewing and denoising


hOcr2pdf.NET used in a multi-threaded GUI application for compressing and ocr'ing scanned pdfs

hocr2pdf.gui.jpg

hOcr2pdf.NET used in a command line program for compressing and ocr'ing scanned pdfs as well as creating new pdfs from images

hocr2pdf.jpg

Special thanks to the developers of:
http://soft.rubypdf.com/software/windows-version-jbig2-encoder-jbig2-exe --to encode images to jbig2
http://code.google.com/p/aforge/ --used for deskewing and image denoising
http://htmlagilitypack.codeplex.com/ --to parse the hocr files
http://code.google.com/p/tesseract-ocr/ --used for OCR and hocr output
https://launchpad.net/cuneiform-linux/ --used for OCR and hocr output
http://itextpdf.com/ --used to create pdfs
http://code.google.com/p/openjpeg/ --used for jpeg2000 compression
http://code.google.com/p/mupdf-converter/ --used for PDF image extraction

Usage:
You can combine serveral files into one pdf or just do one.

string[] files = Directory.GetFiles(dir, "*.jpg");

PDFSettings settings = new PDFSettings();
settings.Author = "Pwizzle";
settings.ImageType = PdfImageType.JBig2;
settings.OcrMode = OcrMode.Cuneiform;

PdfCreator writer = new PdfCreator(settings, Path.Combine(outDir, "out.pdf"));

foreach (string f in files)
{
writer.AddPage(f, PdfMode.Ocr, false, false); //3rd option is for image deskewing, 4th for denoising
}
writer.SaveAndClose();

Last edited May 8, 2012 at 3:00 AM by pwizzle, version 33