Installing Textract

A while ago I wrote about how to extract text from PDF documents in Python using the PDFMiner library. However, in a recent project I had some trouble using PDFMiner to extract text, possibly because the documents I was working with were scanned PDFs. In this case the answer is to use OCR-based text extraction, and that’s exactly what the textract library is able to do by making use of the tesseract OCR algorithms.

Using textract is extremely straightforward:

import textract

pdffile = "myfile.pdf"
text = textract.process(pdffile, method='tesseract', language='eng')

et voila.

However… although using textract is easy, installing it is not.

 

Installing Textract

Here are my notes for the steps I needed to go through to get textract on my laptop.

I am using a MacBook Pro running Mojave 10.14.5 and I’m using a clean python virtual environment.

 

Step 1 Try to install textract using pip. Wait for the error message. If you don’t get one then good for you, otherwise move to Step 2.

 

Step 2 The error message is probably telling you that you don’t have the swig library.

brew install swig

 

Step 3 Once again you try to pip install textract. The error now looks like this:

    deps/sphinxbase/src/libsphinxad/ad_openal.c:43:10: fatal error: 'al.h' file not found
    #include 
             ^~~~~~
    1 error generated.
    error: command '/usr/bin/clang' failed with exit status 1

 

You need to change the header files in the pocketsphinx library. I found that the easiest way to do this is to install pocketsphinx from source. First clone the source code:

git clone --recursive https://github.com/bambocher/pocketsphinx-python/

but before you install the library:

cd pocketsphinx-python/deps/sphinxbase/src/libsphinxad/

and in ad_openal.c change:

#include <al.h>
#include <alc.h>

to

#include <OpenAL/al.h>
#include <OpenAL/alc.h>

then install the library by doing:

cd pocketsphinx-python
python setup.py install

 

Step 4 Now we’re ready to install textract. However, if we try to pip install it then it will try to fetch a different version of pocketsphinx and fail again.

To stop it doing that, grab the textract source tarball from here and untar it:

tar -xvzf textract-1.6.1.tar.gz

then go into the requirements directory:

cd textract-1.6.1/requirements/

open the python file and change:

pocketsphinx==0.1.3

to

pocketsphinx==0.1.15

then install textract:

cd textract-1.6.1
python setup.py install

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: