Boraken resulting text in searchable PDF

Apr 18, 2014 at 8:44 PM
Tesseract-OCR returns this:

"We are short sellers. We are biased. So are long investors. So is the company. "

but it becomes like this in final pdf:

"Wea res horts ellersW. ea reb iasedS. oa rel ongi nvestorSs.o
is thec ompan."

Any thoughts?

Thank you
Coordinator
Apr 19, 2014 at 1:55 AM
Would you mind attaching the image or pdf that you are using so that i can check it out?
Apr 19, 2014 at 2:27 AM


Hi,

Thank you forgetting back to me on this matter.

I have attached two zip files:

1) Broken_TextContent.zip: This file contains original PDF file (PwC 000420.pdf) and its searchable version (PwC 000420_cmp.pdf).
You will be clearly able to see what I am talking ablout.

2) PoorQuality.zip :This file contains a very poor quality Image PDF (poorQuality.pdf) file and it's searchable version (poorQuality_cmp.pdf). Can you suggest me any ways to improve the quality of this?

Appreciate your help on this.
Thank you,

HITESH.
P Please consider the environment before printing this e-mail, thank you.




Coordinator
Apr 19, 2014 at 2:32 AM
It doesn't look like the zip files came through. Want to try again?
Apr 19, 2014 at 2:41 AM


Thank you,

HITESH.
P Please consider the environment before printing this e-mail, thank you.




Coordinator
Apr 19, 2014 at 2:43 AM
I would try setting the write mode to line.
Sometimes if the font is a little smaller and the words are close together, there is some over lapping in the underlay text in the pdf.
Writing by line mode will write the text in a line instead of by each word (default).
           PdfCreator p = new PdfCreator("outputPDF");
            p.PDFSettings.WriteTextMode = WriteTextMode.Line;
Give it a shot and please let me know. I will test it if I get your files and see if I need to make any code changes.
Apr 19, 2014 at 2:53 AM
Hi,

Changing to "WriteTextMode.Line" made it work perfect, thank you so much.

Sorry for my poor knowledge, but can you please explain me when to use WriteTextMode.Word more and when to use WriteTextMode.Line mode, and pros & cons of using either.

By the way, I also sent you a link to the files from ONE Drive, if in case you did not receive the files thru my second email also.

If you got those two files, any thoughts about poor quality image pdf?

Thank you,

HITESH.
P Please consider the environment before printing this e-mail, thank you.




Coordinator
Apr 19, 2014 at 3:18 AM
Edited Apr 19, 2014 at 4:31 AM
Typically when the font size is < 10 or the words are close together then line works better. If the font size if >10 then word yields good results. The cons of Line mode is that the words aren't always lined up properly with the image of the word. I'm working on a better solution in the code to automatically adjust the font size when writing underlay text to prevent it from overlapping.
In the PDFCreator.cs there's a method called WriteUnderlayContent. When the write mode == word then check the height of the bbox for the word. If the height < 10 then set the font to something smaller to prevent overlapping.
       double size = Math.Floor(b.Height);
       if (size < 10)
               size = size - 1;
       cb.SetFontAndSize(base_font, b.Height > 0 ? (int)size : 2);
Coordinator
Apr 19, 2014 at 3:52 AM
I'm sorry, but I'm not sure what you can do about the poor quality images. If you could possibly get them in greyscale and run a median filter on them and also a higher DPI then that would help Tesseract a lot. They look like faxes though and it doesn't look like you have many options. You could try cuneiform instead of tesseract and see if it is a little better.

You can get a win32 version of cuneiform from an older version here
    PdfCreator p = new PdfCreator(pdf.Replace(".pdf", "-NEW.pdf"));
    p.PDFSettings.WriteTextMode = WriteTextMode.Line;
    p.PDFSettings.PdfOcrMode = OcrMode.Cuneiform;
Hope this helps.
Apr 19, 2014 at 5:26 PM
Hi,

Thank you for the reply on this poor quality image.

I can not seem to make Imagemagik work to apply median filter.

Would you be able to give me Imagemagik command line arguments for windows to apply median filter?

Thank you,

HITESH.
P Please consider the environment before printing this e-mail, thank you.




Coordinator
Apr 19, 2014 at 7:33 PM
I would try the Aforge Image Processing Library. It's easy to use and does a good job.
Apr 19, 2014 at 9:01 PM
Thank you so much, I will give it a try.

Thank you
Hitesh Zinzuwadia

May 2, 2014 at 11:02 AM
Hi,

Can you please give me a sample project code to use this hocr2pdf library?
May 2, 2014 at 12:08 PM
It's works as is from hicr2pdf.net downloads.


Thank you,
Hitesh Zinzuwadia

May 2, 2014 at 12:22 PM
The link is not working. If you have any sample project code, please share with me. It will be helpful to me.
May 2, 2014 at 7:24 PM
Tried to email you buy hotmail rejected the attachment because if the file size, but I am sure I downloaded it from codeplex page only.


Thank you,
Hitesh Zinzuwadia

Coordinator
May 2, 2014 at 10:07 PM
Edited May 2, 2014 at 10:09 PM
Download the source code here and look at the hocr2pdf-gui project.

It is a sample application that uses the library.

You can also look at the documentation page.
May 5, 2014 at 8:19 AM
Here, I face some problem to load the image from the pdf file.
"Image not supported in 26.pdf. Skipping".
How can I solve this problem.
Coordinator
May 6, 2014 at 2:13 AM
Please send me a copy of the PDF that you're having problems with and I will debug it and let you know.
May 6, 2014 at 12:00 PM
Hi,
Do you get pdf file?
Coordinator
May 6, 2014 at 6:54 PM
No I'm sorry. I didn't get it.
May 6, 2014 at 8:08 PM
Edited May 6, 2014 at 8:32 PM
use this link to get the pdf file:
https://dl.dropboxusercontent.com/u/55789999/26.pdf

can you access the file?
I am waiting for your reply....
Coordinator
May 6, 2014 at 8:58 PM
I got it. Thanks.
Coordinator
May 6, 2014 at 9:51 PM
It appears to be a bug. I will work on fixing it tonight.
Coordinator
May 7, 2014 at 2:43 AM
It should be fixed. I've updated the source code and the latest release download to include the fix.
May 7, 2014 at 2:47 AM
Just for my knowledge,do you mind sharing what code change you implanted?


Thank you,
Hitesh Zinzuwadia

Coordinator
May 7, 2014 at 3:21 AM
The JBIG2.cs was not encoding the bitmap from the PDF.

This is the method I changed and what it was changed to
        public iTextSharp.text.Image ProcessImage(Bitmap b)
        {

            var img = ImageProcessor.GetAsBitmap(b, (int)b.HorizontalResolution);
            var s = TempData.Instance.CreateTempFile(".jpg");
            img.Save(s, ImageFormat.Jpeg);

            return ProcessImage(s);
        }
May 7, 2014 at 6:12 AM
Edited May 7, 2014 at 6:16 AM
Still now it is not working after modification in JBIG.cs. Can you please send me the working project source code?
May 7, 2014 at 10:11 AM
Now it is working. But, you can see that the characters are not place properly in the pdf file. I tasted the program with WriteTextMode.Word and WriteTextMode.Line both mode. In the WriteTextMode.Line mode, the system can not recognized many characters. But in the WriteTextMode.Word mode, the system can recognized some more components. But still there some problem to insert the text information into the pdf file.
Coordinator
May 7, 2014 at 1:07 PM
The source code is updated. You can download it here
May 7, 2014 at 1:18 PM
Hi,
Do you fix the bug regarding the text information in the pdf file?
Coordinator
May 7, 2014 at 1:38 PM
Here's the result I get. Cuneiform won't even OCR it because the image is so large. You could try to resize the image first and see if the results are any better. I've also found that tesseract doesn't do that great when dealing with pages that aren't simple paragraphs.
May 7, 2014 at 1:43 PM
Hi,
If I don't want to compress the pdf file, then what I have to do?
Coordinator
May 7, 2014 at 1:52 PM
Can I see your results?
May 7, 2014 at 2:05 PM
My output pdf is : https://dl.dropboxusercontent.com/u/55789999/26-NEW.pdf
same as you get. I think that, if we don't compress the pdf file, then it will be better. You can see that, because of compression the size of pdf file come down from 161KB to 56KB size. So, I want to run the system without compression of pdf. Can you help me?
Coordinator
May 7, 2014 at 2:10 PM
The file is compressed after OCR is performed so it wouldn't make a difference.

I can work on it and see if I can make it better, but if words just aren't being OCR'd then that's a limitation of Tesseract and, unfortunately, there's not much I can do about that.
May 7, 2014 at 2:28 PM
So, how can I use the updated Tesseract OCR system. I am familiar with Tesseract-OCR. But, how to use the latest OCR in the hOcr2pdf system?
May 7, 2014 at 2:48 PM
Edited May 7, 2014 at 2:49 PM
I just want to save the pdf file in two different format 1)Compressed and 2) Without Compressed. So, how can I do this?
Coordinator
May 7, 2014 at 2:54 PM
Here's my sample code. You can change the PdfImageType to Jpeg for non compressed or Jbig2 for compressed. The application hocr2pdf-gui currently only compresses the pdf's to jbig2.
            string pdf = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "26.pdf");
            TempData.Instance.CleanAll();
            Clock.Pdf.PdfReader r = new Clock.Pdf.PdfReader(pdf);
            PdfCreator p = new PdfCreator(pdf.Replace(".pdf", "-NEW.pdf"));
            p.PDFSettings.WriteTextMode = WriteTextMode.Word;
            p.PDFSettings.ImageType = PdfImageType.JBig2;
            p.PDFSettings.PdfOcrMode = OcrMode.Tesseract;
            for (int i = 1; i <= r.PageCount; i++)
            {
                var img = r.GetPageImage(i, true);

               // var m = Clock.ImageProcessing.ImageProcessor.ResizeImage(Image.FromFile(img), 300, 25);

                p.AddPage(img, PdfMode.Ocr);
            }
            p.SaveAndClose();
Coordinator
May 7, 2014 at 2:56 PM
You can check for a Tesseract update here.
May 7, 2014 at 3:00 PM
I know about Tesseract-OCR. But, I just want to know that, how to use the latest version of Tesseract OCR in this hOcr2pdf system?
Coordinator
May 7, 2014 at 3:13 PM
hocr2pdf will automatically use whatever version of Tesseract you have installed so there's nothing extra that needs to be done.
May 8, 2014 at 12:49 PM
Hi,
I am tested the image of 26.pdf file in the latest Tesseract-OCR engine. The engine give proper output in hocr format. But the hOcr2pdf system can't create the pdf file properly. Can you please see the bug?
Coordinator
May 8, 2014 at 3:27 PM
Yah I will look into it.
Coordinator
May 9, 2014 at 3:56 AM
Ok I did see a problem. I've updated the source code and will update the release tomorrow.
May 9, 2014 at 6:18 AM
Edited May 9, 2014 at 6:21 AM
Do you already update the code? Because I tested on the the latest updated code, still it is not working. I think that you may use any image processing method in this new update. Due to this modification, you can see that some of the image portions (some line like structure in the upper-right portion of the image & lower-right portion) are deleted. So, please give focus on this problem. On the other hand, I find some problem to use Tesseract-Ocr due to compression of image. I have experience in Image processing. If you need any help, then I can help you.
Coordinator
May 9, 2014 at 2:28 PM
you're right. The Jbig2 encoder is removing parts of the image. Perhaps it's a bug on such large images.

I've come up with a solution to stamp the existing PDF with Ocr data. This will add the OCR data while leaving the existing PDF structure untouched (images, bookmarks, etc... ).

The code will be something like

            PdfReader reader = new PdfReader(pdf);
            var bytes = reader.StampOcrContent(hocr_file_path, new PDFSettings());

            File.WriteAllBytes(pdf.Replace(".pdf", "-NEW.pdf"), bytes);
How does that sound?
May 9, 2014 at 2:37 PM
Can you please the properly write down the code in the Program.cs file of Scratchpad?
Or just send me the Program.cs file which you like to do.
Coordinator
May 9, 2014 at 5:17 PM
I've updated the source code so that you can see the changes.

It works well if you want to keep the original image in the pdf.