A while ago I wrote about how to extract text from PDF documents in Python using the PDFMiner library. However, in a recent project I had some trouble using PDFMiner to extract text, possibly because the documents I was working with were scanned PDFs. In this case the answer is to use OCR-based text extraction, and that’s exactly what the textract library is able to do by making use of the tesseract OCR algorithms.

Using textract is extremely straightforward:

import textract

pdffile = "myfile.pdf"
text = textract.process(pdffile, method='tesseract', language='eng')

et voila.

However… although using textract is easy, installing it is not.


Here are my notes for the steps I needed to go through to get textract on my laptop.

I am using a MacBook Pro running Mojave 10.14.5 and I’m using a clean python virtual environment.


Step 1 Try to install textract using pip. Wait for the error message. If you don’t get one then good for you, otherwise move to Step 2.


Step 2 The error message is probably telling you that you don’t have the swig library.

brew install swig


Step 3 Once again you try to pip install textract. The error now looks like this:

    deps/sphinxbase/src/libsphinxad/ad_openal.c:43:10: fatal error: 'al.h' file not found
    1 error generated.
    error: command '/usr/bin/clang' failed with exit status 1


You need to change the header files in the pocketsphinx library. I found that the easiest way to do this is to install pocketsphinx from source. First clone the source code:

git clone --recursive

but before you install the library:

cd pocketsphinx-python/deps/sphinxbase/src/libsphinxad/

and in ad_openal.c change:

#include <al.h>
#include <alc.h>


#include <OpenAL/al.h>
#include <OpenAL/alc.h>

then install the library by doing:

cd pocketsphinx-python
python install


Step 4 Now we’re ready to install textract. However, if we try to pip install it then it will try to fetch a different version of pocketsphinx and fail again.

To stop it doing that, grab the textract source tarball from here and untar it:

tar -xvzf textract-1.6.1.tar.gz

then go into the requirements directory:

cd textract-1.6.1/requirements/

open the python file and change:




then install textract:

cd textract-1.6.1
python install

