Mangled text format in resulting PDF?

May 23, 2011 at 2:14 AM

The issue I'm still having is that when I cut-and-paste the text out of the resulting pdf, the formatting is completely messed up.  A basic Tesseract text conversion of the same .tif file yields this:

"We’ve sort of been brainwashed as a culture to believe that our children are the leastcompetent generation to roam the earth,” Ms. Skenazy said.

But the equivalent text cut-and-pasted from the resulting PDF created using this library is:

"Weâ€s™ovreotfbeenbrainwasahseadcultutroebelietvheaoturchildraernetheleastcompetegnetnerattioornoamtheearthM,sâ.S€kenazsyaid

Opening the interim *.hocr.html file in a web browser looks ok, too, so it seems like something in the PDF creation step.

Any thoughts?  Is this a known issue or am I doing something wrong?

Coordinator
May 28, 2011 at 5:44 AM

For some reason, certain characters get converter into the code equilivent. There's a method call CleanText that you can call to convert certain chars into what they should be. I've been on vacation the past few weeks and will look at these issues when i get home. I'll also look into the issue about the spaces not being between each word when i get home. I'm not sure if it's a bounding box issue or if a need to concat a space after each word.

Coordinator
May 28, 2011 at 5:46 AM
This discussion has been copied to a work item. Click here to go to the work item and continue the discussion.
Coordinator
May 3, 2012 at 9:03 PM
This discussion has been copied to a work item. Click here to go to the work item and continue the discussion.