1

Closed

Reserve processing

description

Dear pwizzle,

There are some methods in PDFReader so that extract text from a PDF File. But I don't see any method to extract text with their coordinations from PDF-Text or PDF-Searchable. So could you add the feature into new version of hOcr2Pdf.NET library?

Rgs,

Quoc Tuan.

file attachments

Closed Apr 9, 2014 at 3:44 AM by pwizzle

comments

pwizzle wrote Apr 6, 2014 at 3:23 AM

Hi Quoc,

I've updated SVN with the latest code changes.

I've added GetWordLocationsForPage(int Pagenumber) in the PDFReader class.

It returns a IList<hOcrClass> collection that contains the BBox info (in points, not pixels) and text.

I hope this helps,

PW

tuandq wrote Apr 6, 2014 at 4:03 PM

Thank you! I'll test it and report you soon!
Quoc Tuan.

wrote Apr 6, 2014 at 5:25 PM

tuandq wrote Apr 6, 2014 at 5:25 PM

Dear pwzille,
The GetWordLocationsForPage method ran very well and it returned all I need. However, when I tested GetPageText and ExtractText methods, their results were very strange and I didn't understand them!

Please test on a my PDF-searchable file!

Regards,
Quoc Tuan.

pwizzle wrote Apr 6, 2014 at 10:30 PM

I didn't realize those methods were broken. They've been fixed now and I've uploaded the changes to the SVN. Thank you for letting me know.
  string pdf = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "LA004.pdf");
            Clock.Pdf.PdfReader r = new Clock.Pdf.PdfReader(pdf);
            string t = r.ExtractText();
            File.WriteAllText(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.txt"), t);

pwizzle wrote Apr 7, 2014 at 12:10 AM

Added ability to extract a region of interest from an image for ocr.
    BBox b = new BBox();
            b.Left = 2217;
            b.Top = 2400;
            b.Width = 312;
            b.Height = 129;
            var bmp =  (Bitmap) Bitmap.FromFile(img);
            bmp = ImageProcessor.CropToRectangle(b,bmp);

wrote Apr 9, 2014 at 3:44 AM