A while ago I wrote about how to extract text from PDF documents in Python using the PDFMiner library. However, in a recent project I had some trouble using PDFMiner to extract text, possibly because the documents I was working with were scanned PDFs. In this case the answer is to use OCR-based text extraction, and that’s exactly what the
textract library is able to do by making use of the
tesseract OCR algorithms.
textract is extremely straightforward:
import textract pdffile = "myfile.pdf" text = textract.process(pdffile, method='tesseract', language='eng')
However… although using
textract is easy, installing it is not.
Here are my notes for the steps I needed to go through to get
textract on my laptop.
I am using a MacBook Pro running Mojave 10.14.5 and I’m using a clean python virtual environment.
Step 1 Try to install textract using pip. Wait for the error message. If you don’t get one then good for you, otherwise move to Step 2.
Step 2 The error message is probably telling you that you don’t have the swig library.
brew install swig
Step 3 Once again you try to pip install
textract. The error now looks like this:
deps/sphinxbase/src/libsphinxad/ad_openal.c:43:10: fatal error: 'al.h' file not found #include ^~~~~~ 1 error generated. error: command '/usr/bin/clang' failed with exit status 1
You need to change the header files in the
pocketsphinx library. I found that the easiest way to do this is to install
pocketsphinx from source. First clone the source code:
git clone --recursive https://github.com/bambocher/pocketsphinx-python/
but before you install the library:
then install the library by doing:
cd pocketsphinx-python python setup.py install
Step 4 Now we’re ready to install textract. However, if we try to pip install it then it will try to fetch a different version of
pocketsphinx and fail again.
To stop it doing that, grab the
textract source tarball from here and untar it:
tar -xvzf textract-1.6.1.tar.gz
then go into the requirements directory:
python file and change:
then install textract:
cd textract-1.6.1 python setup.py install