Document Scraping with Python

Tired of reading all those documents everyone keeps sending you? Why not get your Jupyter Notebook to do it for you and condense the information?

I’m joking of course… but if say you did want to read pdf documents directly in Python, how would you do it? Recently I had a go at doing just that.

Read the Docs

I’m going to use the PDFMiner library to extract the text from pdf files. It’s pip installable:

pip install pdfminer

A description of the API can be found here, but basically it’s all encapsulated in this diagram. (If you don’t care about the API and just want to know how to use it you can skip this)


Say we have some PDF document (I’m using the first one linked on this page), how do we go about reading it?

Well, first off let’s see what the meta-data says about the document. We need to parse the document and create an object from it that is readable by the rest of the PDFminer library. So we need to access the top two classes of the library from the diagram.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

We can then use these to extract the meta-data information. I’m going to create a general function to do this:

def extract_pdf_info(filename):

    fp = open(filename, 'rb')
    parser = PDFParser(fp)
    doc = PDFDocument(parser)


which I’ll call like this:

file_info = extract_pdf_info(filename)

print file_info

[{'Producer': 'Adobe PDF Library 11.0', 'Creator': 'Acrobat PDFMaker 11 for Word', 'Author': 'Department for Transport', 'Title': 'Social and behavioural questions associated with Automated Vehicles: A Literature Review', 'ModDate': 'D:20170124145712Z', 'Keywords': 'social and behavioural questions, automated vehicles, a literature review', 'CreationDate': 'D:20170124145427Z', 'Subject': 'Social and behavioural questions associated with Automated Vehicles: A Literature Review'}]

To go a bit deeper and extract the text from the pdf, we need to invoke the PDFInterpreter and PDFResourceManager classes, along with some other bits and pieces:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

I’m also going to use some more standard libraries:

import numpy as np
import string

We can then write a function to extract the text:

def convert_pdf2txt(filename):

    # step 1: initiate the Interpreter class as a text reader:
    rsrcmgr = PDFResourceManager()  # initiate the resource manager class
    retstr = StringIO()  # creates a string buffer
    codec = 'utf-8'  # data encoding
    laparams = LAParams()  # initiate the layout class

    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    # step 2: extract the pages from the pdf file:
    fp = file(filename, 'rb')
    pages = PDFPage.get_pages(fp, check_extractable=True)

    # step 3: loop through all the pages running the interpreter:
    for page in pages:

    # get all the output:
    text = retstr.getvalue()

    # tidy things up:

    return text

Note the use of the StringIO() to create the output file-like class.

We can call the function like this:

file_contents = convert_pdf2txt(filename)


Assessing the Contents

This puts the whole document into one long string, called file_contents. I want to look at individual words in the document, so I’m going to split it up into words:

words = file_contents.split()
print "File contains: ",len(words)," words"

File contains: 41804 words

At the moment our “words” are not all words. The split() command just breaks up a string into little strings using the default separation, which is a character space.

Before we do any analysis of the word data we’ll need to clean it up a bit.

Cleaning the Data

First off, let’s get rid of any punctuation. There are two nifty pythonisms I’m using here: Firstly, string.punctuation provides a complete(?) set of punctuation characters. You can find more information on it here. Secondly, the string translate method allows us to replace – or in this case remove – elements of a string that match a criterion.

# remove punctuation:
words[:] = [value.translate(None, string.punctuation) for value in words]

Once we’ve got rid of punctuation we can then get rid of numbers. I’m just getting rid of them completely here, but if we wanted we could extract them into a separate data set.

I like this one, it’s a function in a function in a function in a function 🙂

# remove numbers:
words[:] = [value for value in words if not any(c.isdigit() for c in value)] 

In some of our “words” we’ll also probably find some special characters that don’t print properly. We need to make everything into ascii text.

# remove UTF-8 encoded special characters:
words[:] = [value.decode("ascii", errors="ignore").encode() for value in words]

We’ve now got rid of a lot of the guff in our dataset. On to some more specific considerations.

For (e.g.) a simple word frequency analysis we probably don’t want to include words like “the”, “and”, “in”, “on” etc. [Although if we’re doing something more involved, like an association analysis, we might want to leave them in.]

# remove small words:
words[:] = [value for value in words if len(value)>3]

Our dataset of words should now be pretty clean. If we want to know the frequency of each word, we can use the numpy.unique function.

unique, counts = np.unique(words, return_counts=True)

This returns an array of the unique elements in an array and, if we set return_counts=True, an array of the frequency of occurrences of each element: counts.

These arrays will be ranked in alphabetic order, so we’ll need to sort them differently if we want a dataset ranked in terms of the frequency. If we just blindly combined (unique,counts) into a single array and ran the numpy.sort function on it to rank by the counts column it wouldn’t work.

The reason it wouldn’t work (and when I say wouldn’t work, I don’t mean it would fail to run) is that the counts data would be converted from integer to string and the column would then be sorted lexicographically. For example, if we had the array ['1','2','12','23'], it would be sorted to give ['1','12','2','23']. We need our array to be sorted numerically, so we need to make sure that the counts array stays as an integer.

# we need to make sure that "counts" stays an integer
# in the full data array (not a string), otherwise the
# sorting will work lexicographically rather than numerically
data = np.ndarray((len(unique),2),dtype=object)
data[:,0] = unique
data[:,1] = counts

print data.shape

(4099, 2)

We can then sort the data on the second column.

sorted_data = data[np.argsort(data[:,1])]
for i in range(0,sorted_data.shape[0]):
    print sorted_data[i,:]

['zone' 1]
['alter' 1]
['motor' 1]
['motorbikes' 1]
['motorway' 1]
['alone' 1]
['almost' 1]
['movie' 1]
['multidisciplinary' 1]

This seems to have worked, but if you look a bit more closely there are some elements that haven’t cleaned up so nicely. These are generally web-links (i.e. http// …) and document names (i.e. .pdf, .html etc).

We could improve our cleaning operation to remove these, or extract them into a separate dataset. [Note – we probably want to do this before removing the punctuation in order to retain the correct formatting.]

# extract weblinks:
weblinks[:] = [value for value in words if value[0:4]=='http']

# remove weblinks from word data:
words[:] = [value for value in words if value[0:4]!='http']

and similarly for document names.

And so on and so forth. There’s a bunch of stuff we could do.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: