A while ago I wrote about how to extract text from PDF documents in Python using the PDFMiner library. However, in a recent project I had some trouble using PDFMiner to extract text, possibly because the documents I was working with were scanned PDFs. In this case the answer is to use OCR-based text extraction, and that’s exactly what the textract
library is able to do by making use of the tesseract
OCR algorithms.
Using textract
is extremely straightforward:
import textract pdffile = "myfile.pdf" text = textract.process(pdffile, method='tesseract', language='eng')
et voila.
However… although using textract
is easy, installing it is not.
Installing Textract
Here are my notes for the steps I needed to go through to get textract
on my laptop.
I am using a MacBook Pro running Mojave 10.14.5 and I’m using a clean python virtual environment.
Step 1 Try to install textract using pip. Wait for the error message. If you don’t get one then good for you, otherwise move to Step 2.
Step 2 The error message is probably telling you that you don’t have the swig library.
brew install swig
Step 3 Once again you try to pip install textract
. The error now looks like this:
deps/sphinxbase/src/libsphinxad/ad_openal.c:43:10: fatal error: 'al.h' file not found #include ^~~~~~ 1 error generated. error: command '/usr/bin/clang' failed with exit status 1
You need to change the header files in the pocketsphinx
library. I found that the easiest way to do this is to install pocketsphinx
from source. First clone the source code:
git clone --recursive https://github.com/bambocher/pocketsphinx-python/
but before you install the library:
cd pocketsphinx-python/deps/sphinxbase/src/libsphinxad/
and in ad_openal.c
change:
#include <al.h>
#include <alc.h>
to
#include <OpenAL/al.h>
#include <OpenAL/alc.h>
then install the library by doing:
cd pocketsphinx-python python setup.py install
Step 4 Now we’re ready to install textract. However, if we try to pip install it then it will try to fetch a different version of pocketsphinx
and fail again.
To stop it doing that, grab the textract
source tarball from here and untar it:
tar -xvzf textract-1.6.1.tar.gz
then go into the requirements directory:
cd textract-1.6.1/requirements/
open the python
file and change:
pocketsphinx==0.1.3
to
pocketsphinx==0.1.15
then install textract:
cd textract-1.6.1 python setup.py install