Sunday, December 6, 2015

PDF to Images and Text

For a while now I have extracted text from financial statements and imported it into Excel for further accounting. The statements are PDF documents and until recently it was relatively easy to use the Adobe Reader "Save as Text..." feature to save the text content.  For my latest statement this didn't work. It appears the PDF documents no longer contain "text", but rather vector paths are to render the text instead. I have no idea why they decided to change the format?!

Anyway, to deal with this new format I had to change my approach for getting the data into Excel. With the help of this page Tutorial: Command-line OCR on a Mac I was able to build a process to do the following:
  1. Use pdftk to burst the multi-page PDF into single-page files.
  2. Use inkscape to convert each page into a PNG file.
  3. Use Tesseract to OCR these image files into text.
  4. Extract and format the text files for import into Excel.
Note that at the time of writing there are some issues running pdftk the latest version of OS X. Fortunately this has been fixed as detailed in this post: PDFtk Server on OS X 10.11.