hOcr2Pdf.NET is a library that programmers can use to create highly compressed, searchable pdf's for applications.

Requirements:
.NET 4.0 or higher
Tesseract 3.0 w/ the ability to produce hOcr files or Cuneiform For Linux
JBig2.exe (included) in the same path as the dll

Major Classes:
PDFDoc (PDFDoc.Open() OR PDFDoc.Create())

Example Usage

Compress PDF to Jbig2.

    PDFDoc doc = PDFDoc.Open(file);
    doc.CompressJBig2()

Get page image (Jbig2 and jpeg2000 pages require Ghostscript to be installed)

    PDFDoc doc = PDFDoc.Open(file);
    doc.GetPageImage(1);

Ocr PDF

    PDFDoc doc = PDFDoc.Open(file);
    doc.Ocr(Utils.OcrMode.Tesseract, "eng", WriteTextMode.Word, null);
 

Create a new PDF

    PDFDoc doc = PDFDoc.Create(file);
    doc.AddPage(img, PageSize.Letter);
    doc.Rotate(...)
    doc.Save()
    doc.Ocr(...)
    doc.Compress(...)
    doc.Save()

Get Object graph of HOCR document

            hDocument d = OcrController.CreateHOCR(OcrMode.Tesseract, "eng", img);
            foreach(var p in d.Pages)
                foreach(var para in p.Paragraphs)
                    foreach(var l in para.Lines)
                        foreach(var w in l.Words)
                            Console.WriteLine(w.Text);

Tips

Be sure and Save() the pdf when using an image format that requires Ghostscript to extract. For example,
if you compress a pdf to jbig2 and then try to ocr it before calling Save() then all bets are off. Save() writes any change to disk so that Ghostscript can access the changed pages for image extraction.

Last edited Feb 17, 2015 at 3:25 PM by pwizzle, version 34

Comments

mail2vguna Aug 21, 2015 at 8:46 AM 
I am able to build the application without any error.

If i give pdf file as input , i got output pdf file but the output pdf file is not text searchable PDF. Its like a copy of input file.

And also while processing the input file i got following 3 files in the C:\Users\USER\AppData\Local\Temp\Clock.hocr folder.

1. pdf file. 2. bmp file and OCR text file.

Am i doing anything wrong? I need output pdf file as text searchable PDF.

How can i get it. I am very urgent of this one.

Kindly do the needful.

Looking forward your reply.

sonealtx Oct 3, 2014 at 7:54 PM 
Hi, this is a very nice project and is very useful to me. I was wondering if you might want a hand to re-factor / re-work some of the code? Let me know, I'd be happy to try to help you.

pwizzle Nov 29, 2013 at 5:00 AM 
I was never happy with the J2K output and have removed it from the library completely. My main focus has been on Jbig2. Since the library just uses command line utilities to compress the images, you can take a look at the Jbig2.cs and see how i'm doing it. However, i've used the library in several web applications with success.

puredevelopers Jun 18, 2013 at 1:56 AM 
Would it be possible for you to put together a web application with just the pdf compression component?
I will look at the code you have above, but I downloaded the project, and I am getting errors when I try to debug it.
The J2k.cs file is bombing on the ProcessImage method. It appears that TempData is null.
I am not sure if I am missing, or need to download, additional dll's.