The UK parliament has a bunch of different committees who are tasked with conducting inquiries and producing reports about a range of different topics. The individual inquiries normally have a fixed duration and they hear evidence from a range of individuals known as witnesses. The UK parliament website maintains a list of open inquiries that is updated on an ad hoc basis.
Rather than manually checking the webpage and making a spreadsheet of these inquiries by hand, this blog shows you how you can set up a midnight job (using
launchd) to run a Python script which scrapes information from the webpage and automatically updates a Google sheet instead.
Getting the Data
I’ve blogged about pulling data from web pages using Beautiful Soup and the Python requests library before, so I won’t repeat that here. If you do want to see what my code for scraping the open inquiries webpage looks like you can see it here.
Spreading the Data
Once you’ve got some data you need a spreadsheet to write it to. You can create a sheet directly from Python, but it’ also pretty quick to just do it yourself by hand. That’s what I did. I made a sheet with the name “Policy Spreadsheet” and I made five tabs: All, Commons, Lords, Joint, Public.
Next is the most complicated bit… getting the credentials.
I did see some blogs that said you could just login using your normal Google username and password with the function
gspread.login(username,password), however this seems to be deprecated – or at least it threw an error when I tried it. Which basically leaves OAuth2 using the Google API.
It’s free and anyone can sign up, it’s just perhaps not the most intuitive process.
If you already have a project started and you just want to add some more credentials, you can simply go to the Credentials tab and add a service account using Create Credentials.
To connect you need the following lines of code:
import gspread from oauth2client.service_account import ServiceAccountCredentials # use creds to create a client to interact with the Google Drive API scope = ['https://spreadsheets.google.com/feeds'] creds = ServiceAccountCredentials.from_json_keyfile_name('client_secret.json', scope) client = gspread.authorize(creds)
Once you’ve got the client_secret.json file you should look inside and copy the
client_email. Go to the google sheet you want to write to and make sure it’s shared with that email address.
I’m using the gspread library, which is a nice wrapper around the Google sheets API. It is pip installable:
pip install gspread
To access the sheet from Python using gspread it’s really straight forward:
sheet = client.open("Policy Spreadsheet")
You can also access all of the tabs and their worksheets using:
sheet1 = sheet.worksheet("All") sheet2 = sheet.worksheet("Commons") sheet3 = sheet.worksheet("Lords") sheet4 = sheet.worksheet("Joint") sheet5 = sheet.worksheet("Public")
I read another blog which said it was possible to navigate sheets using
sheet.sheet2 etc but that gave me an error, and looking at the gspread docs I’m pretty sure that the only one you can use is
sheet.sheet1, which is a short cut to the first sheet. The higher ones won’t work.
If you aren’t sure what the names of all your worksheets are you can use:
Then you want to write in some content. I’m going to add in all the data that I scraped from the parliament webpage. I’ll read the info out of each dict in the list and then write it into the sheet.
import datetime # find today's date to tag input data: now = datetime.datetime.now() indate = now.strftime("%Y-%m-%d") for inquiry in inquiries: # extract info from dict: itype = inquiry['type'] name = inquiry['inquiry name'] committee = inquiry['committee'] deadline = inquiry['deadline'] status = inquiry['status'] link = inquiry['link'] # compose the contents of the row you want to add: row = [name,itype,committee,"",deadline,status,link,indate,indate] # specify the index of the row (2 here because of header row): index = 2 # add into "All" sheet first: sheet1.add_rows(index) sheet1.insert_row(row,index) # then add into appropriate tab: if itype=='Commons': sheet2.add_rows(index) sheet2.insert_row(row,index) elif itype=='Lords': sheet3.add_rows(index) sheet3.insert_row(row,index) elif itype=='Joint': sheet4.add_rows(index) sheet4.insert_row(row,index) elif itype=='Public': sheet5.add_rows(index) sheet5.insert_row(row,index)
I found that using
sheet.add_rows() first just overwrote the previous contents of the row rather than adding in a new one.
index=2 because I have columns headings in the first row.
Updating the Data
I’m planning to run this code on a regular cadence to keep updating the sheet with information about new inquiries. One thing I want to keep track of is the deadline for evidence, which sometimes changes. I don’t want to have to re-write the whole sheet each time (you’ll quickly work out that it’s really slow to write things to Google sheets), so I just want to search and replace deadline info.
To do this I’m going to match the item using the name of the inquiry, which should be unique. I can find the index of the row using
sheet.find(), which returns a
gspread.Cell object with the properties
# find index of row: index = sheet.find(name).row # replace row contents: sheet.insert_row(row,index)
Ideally I’d also like to then sort the spreadsheet by the deadline column, i.e. put the most pressing items at the top, but there doesn’t seem to be a functionality to do this with gspread 😞
Automating the Updates
To start I’m going to make a little script for myself. The code itself is a Python script called
scrape_inquiries.py, but it imports a library of gspread functions I’ve bundled up together separately and it requires the
client_secret.json input file and makes the
inquiry_list.json output file. I want all that to happen in the same place so rather than running the code directly, I’ve made little script to change directory and then run the code. I’ve called this
import os,sys os.chdir("/path/to/script") os.system("python scrape_inquiries.py")
Then I’m going to make a launchd plist file, which runs the script every day at midnight:
I’ll copy that into my LaunchDaemons directory:
cp com.policy.inquirylist.plist /Library/LaunchDaemons
Load it up:
launchctl load -w /Library/LaunchDaemons/com.policy.inquirylist.plist
Check it’s there:
Then sit back and forget about it.
Then for the blog this.