pdfcompresser.exe doesn't work

Jul 17, 2013 at 9:54 AM
Edited Jul 17, 2013 at 12:37 PM

I now figured what I did wrong: it seems essential that
  • the whole hocrtopdf.net folder lies in the same path as the tesseract engine (c:/program files/) and
  • the pdf files you'd like to ocr are placed in the hocrtopdf.net folder (and not anywhere else; the output folder however can be chosen).
under these conditions pdfcompresser.exe did work for the ocr.

BUT: it did not work when I chose OCR + 'deskew' or 'denoise'.

The much bigger issue, however,was that in the output-file the text underlying the pdf image was absolutely at the 'wrong place', i.e. the invisible text layer was not at all compatible with the letters in the image. it seems that the coordinates for the hocr don't seem to be correct or something went wrong whent the hocr was matched with the image.

Is this a known issue? Can anyone help how to fix that?

Thanks for your help.


Hi pwizzle,

Me again. I think your project sounds awesome since there is really a need for such an app. Unfortenately I'm really no programmer and although I think of my self as an experienced windows user I simple don't get how I have to run the pdfcompresser.exe.

Here is what I've done:
0.) I installed tesseract with the .exe-file, it sits in c:/program files/tesseract-ocr

1.) I downloaded the .7z-file from this homepage
2.) I extracted it to an arbitrary folder: it included three folders ('pdf compressor', 'sample app', source') and two archives ('cuneiform' and 'hocrtopdf.net binaries')
3.) Since the 'pdf compressor' folder was empty, I went to the 'sample app' folder and opened the 'pdfCompressor.exe'

4.) Everything seemed to be okay so far: I could see the gui, I added (multipage) pdf-files, chose the output-directory (which was different from the source-directory of the input files), set the engine to 'tesseract' and the 'language' to deu-frak (which is actually installed, I tested it with the tesseract command line); everything else I left as it came as default.
5.) Now I clicked on 'start' but nothing happened. The program just sits there and nothing seems to happen (someone else in the forum seemed to have the same problem). So after a few minutes I decided to exit the program.

So I tried the following things:
  • I copied the whole 'pdftohocr.net' folder (the one that was in the archive I dowloaded) to my path direcotry (c:/program files) and lauched 'pdfcompressor from there.
  • I extracted the the cuneiform archive to my path-directory where also tesseract lies
But so far it just didn't run. What did I fail to do? Assuming that the pdfs I used might not work would it be possible to provide a sample pdf with which you tested it and it worked?

I'm sorry for bothering you but I hope that after this detailed explanations you might see where my problem lies.

Furthermore, I hope that a precise description of what one has to do if one wants to use this nice tool may help inexperienced users who just would like to make a searchable pdf but don't get how to make the program run.

Thanks a lot for your effort and your time!
Apr 2, 2014 at 2:55 PM
Almost a year later and....

I never paid much attention to the PDFCompressor program as it was a scratchpad for testing the library. This has been a project that was helpful for me at work because commercial solutions were too expensive.

A lot of changes have happened since July 2013, but if you're still around, I've attached a simple pdfCompressor that will make searchable PDFs from existing ones.

You need to install Ghostscript and Tesseract for it to work. Also, it only works for non unicode english language.

You can download it here.