Use hOcr2Pdf on German blackletter (Fraktur) text

Apr 30, 2012 at 12:43 AM

This tool looks great for what I want to do, which is to create a searchable image PDF from an image-only PDF of German blackletter/Fraktur (old-style font, see here: http://en.wikipedia.org/wiki/File:Gebrochene_Schriften_klein.png) text. I have installed Tesseract from the .exe installer available on Google Code (https://code.google.com/p/tesseract-ocr/) using the default settings, adding German-Fraktur as a language. This means that Tesseract is located at "C:\Program Files (x86)\Tesseract-OCR\tesseract.exe".

I downloaded the hocr2pdf.zip file and extracted the contents to my "C:/Program Files (x86)" directory. When I open "C:\Program Files (x86)\PdfCompressor\bin\Release\PdfCompressor.exe" and try to run OCR on the PDF using Tesseract as the engine, the program freezes and I get the error message "PdfCompressor has stopped working."

I assume that something is setup improperly, but I am not a programmer by means (I am a historian who works with old German texts!), just someone with slightly higher-than-average computer skills. I apologize for my ignorance but I would be profoundly grateful is someone could help me out with this. Thanks!

Apr 30, 2012 at 3:22 AM

It looks like Tesseract correct set itself under PATH.

I tried using PDFCompressor.exe without OCR, just deskew (for example) and that also did not work.

I have .NET Framework 4 Client Profile installed.

Coordinator
May 1, 2012 at 11:44 PM

I f you could, please send me an email of the pdf you're having problems with and i'll debug the program. The pdf may contain a type of image that is causing the problem. The next version that i'm working on contains the muPDF library for parsing images and should do a much better job supporting all of the images that a pdf could contain.  

Coordinator
May 1, 2012 at 11:46 PM

The program was written to help me out with my daily office work and i would love to have it help you out as well.

Coordinator
May 3, 2012 at 8:58 PM
This discussion has been copied to a work item. Click here to go to the work item and continue the discussion.
Apr 18, 2013 at 1:36 PM
Hy pwizzle,

I had the same purpose and have encountered the same problems as dhensley2.

I have .NET FRamework 4 installed. I did install the Tesseract engine to the default directory. After that I extracted the latest release of hocrtopdf to a folder of my choice. I then chose "PdfCompressor.exe" from the "sample app" directory and added one pdf, chose the output directory, changed the engine on "tesseract" and the language to "deu-frak" (German Gothic which I installed before with Tesseract).

When I pressed the "start"-Button, a small new file (15kb) was created in the output directory but nothing else happened. It also didn't show any message in the "progress"-field. Sometimes nothing else happened, sometimes I got the error message "PdfCompressor has stopped working."

Even though you seemed to fix the problem already, I don't know what I did wrong. Me too, I'm a historian and not a programmer and I would be very grateful if someone could help me out!

Thanks and cheers.
Coordinator
Jun 6, 2013 at 6:44 PM
I don't mind. Would it be possible to send me a sample PDF file of what you are using so that i can try and fix the issue?
Jun 20, 2013 at 12:52 PM
Edited Jun 20, 2013 at 12:57 PM
Thanks for answering. I'll send you PDF files and a screenshot of the compresser.exe running while nothing happens.