1

Closed

Mangled text format in resulting PDF?

description

The issue I'm still having is that when I cut-and-paste the text out of the resulting pdf, the formatting is completely messed up. A basic Tesseract text conversion of the same .tif file yields this:
 
"We've sort of been brainwashed as a culture to believe that our children are the leastcompetent generation to roam the earth," Ms. Skenazy said.
 
But the equivalent text cut-and-pasted from the resulting PDF created using this library is:
 
"Weâ€s™ovreotfbeenbrainwasahseadcultutroebelietvheaoturchildraernetheleastcompetegnetnerattioornoamtheearthM,sâ.S€kenazsyaid
 
Opening the interim *.hocr.html file in a web browser looks ok, too, so it seems like something in the PDF creation step.
 
Any thoughts? Is this a known issue or am I doing something wrong?
Closed Apr 9, 2014 at 3:45 AM by pwizzle

comments

pwizzle wrote Apr 9, 2014 at 3:45 AM

New version should be working well.

wrote Apr 9, 2014 at 3:45 AM