This project is read-only.

strange result with tesseract-generated cyrillic text html

Jan 17, 2012 at 10:46 AM
Edited Jan 17, 2012 at 1:31 PM

hi! first, thanks for hocr2pdf, it's very useful.

but i have a problem with html files, maked by tesseract from image with russian text

when i do:

tesseract test.jpg test -l rus hocr

hocr2pdf -t test.txt -i test.jpg -o test.pdf < test.html

i have normal text.txt file with russian text and really strange pdf with smth like "/ =57=0N,
70G5< :><C =C6", etc

html file looks perfectly normal, it contains russian words between tags

everything is ok, when i use image with text in english

please, tell me, where the problem can be? may be, i need some extra libs?

thank you for any help.

Apr 28, 2012 at 4:58 AM

In the Parser.cs file, you may want to try to change the encoding to something else

Change doc.Load in the ParseHOCR method to something like this            

doc.Load(hOcrFile, Encoding.UTF8);

May 3, 2012 at 9:02 PM
This discussion has been copied to a work item. Click here to go to the work item and continue the discussion.