The issue I'm still having is that when I cut-and-paste the text out of the resulting pdf, the formatting is completely messed up. A basic Tesseract text conversion of the same .tif file yields this:
"We’ve sort of been brainwashed as a culture to believe that our children are the leastcompetent generation to roam the earth,” Ms. Skenazy said.
But the equivalent text cut-and-pasted from the resulting PDF created using this library is:
Opening the interim *.hocr.html file in a web browser looks ok, too, so it seems like something in the PDF creation step.
Any thoughts? Is this a known issue or am I doing something wrong?