The UK parliament has a bunch of different committees who are tasked with conducting inquiries and producing reports about a range of different topics. The individual inquiries normally have a fixed duration and they hear evidence from a range of individuals known as witnesses. The UK parliament website maintains a list of open inquiries that …
On The Buses II: Fuzzy String Matching
This is the second part of a series of posts about my pet data science project exploring the availability of transport across different areas of Manchester. For those playing catch-up, you might want to take a look at the first post in this series before continuing. In the first post I looked at how to …
On the Buses
I've started a little project to look at availability of public transport across Manchester. To keep things easy to follow I'm going to split up the different elements of the project between blog posts. First off, I want to know all the bus routes in Greater Manchester and I want to know which ones go …
Mining Twitter with Selenium
This is great for freaking people out. It looks like a ghost is typing in your web browser. Web crawling using html parsers to grab links and navigate to new pages with the requests library is all very well, but when you want to physically submit search terms, or login details, or click buttons (etc.) …
WebCrawling: YouTube Pagination in Python
A while ago I wrote a blog post about how to scrape videos from YouTube. One question I've been asked since is how to navigate between different pages of search results. So here's how. YouTube The pre-amble looks exactly the same: Pagination Then we need to find the piece of html that corresponds to …
Continue reading "WebCrawling: YouTube Pagination in Python"
Newtonian Funding
Research funding in the UK is very transparent. In fact you can access details of every RCUK (Research Councils UK) funded project through their online gateway. Like a lot of UK government data the content is available under the Open Government Licence v3.0, except where otherwise stated. The Newton Fund is part of the UK's …
Web Scraping YouTube Videos in Python
Web crawling and web scraping are two sides of the same coin. Web scraping is simply extracting information from the internet in an automated fashion. Web crawling is about indexing information on webpages and - normally - using it to access other webpages where the thing you actually want to scrape is located. YouTube is …
Night at the Museum II: OCR in Python
Wherein I fail to make Python read medieval Chinese calligraphy correctly, but I learn a lot about Chinese calligraphy and optical character recognition. The Plan It seems sensible that if we can read Chinese in our Jupyter notebook, we should try to translate the writing that's actually on the National Palace Museum exhibits and not …
Night at the Museum: Translation in Python
I love Taipei. I also love Open Data. So I was very happy to read that the National Palace Museum in Taipei had an open data project. According to the article, the museum has put images and meta-data for 70,000 items online. So what do you get if you download the information on a particular …
Continue reading "Night at the Museum: Translation in Python"
Document Scraping with Python
Tired of reading all those documents everyone keeps sending you? Why not get your Jupyter Notebook to do it for you and condense the information? I'm joking of course... but if say you did want to read pdf documents directly in Python, how would you do it? Recently I had a go at doing just …