This is the second part of a series of posts about my pet data science project exploring the availability of transport across different areas of Manchester. For those playing catch-up, you might want to take a look at the first post in this series before continuing.

In the first post I looked at how to find out where all the bus routes in Manchester go. In this post I’m going to look at how often they go there.
This all adds in my objective of determining the availability of buses across Manchester. Ultimately I want to define availability as the average number of buses per hour in a day, i.e. a bus stop with one bus every 20 minutes would have the same availability as a bus stop with three buses once an hour.
Code-wise, there are two key parts to this post:
- How to navigate multi-level HTML using Selenium;
- How to deal with inconsistent labelling using fuzzy string matching.
Web-crawling multi-level HTML with Selenium
All of the timetable information for bus routes in Greater Manchester is provided by Transport for Greater Manchester (TfGM) on their website. You can download a PDF timetable from home page for each route and I did think about trying to scrape those, but… they don’t list all the stops and the stops have different names to the ones labelled with longitude and latitudes.
The alternative is to the use the travel planning pages of the TfGM website which render an HTML timetable for each route that includes all the stops. However, this approach is not without slight problems: (1) we need to submit a web-form to enter the route number we want; (2) there are multiple routes with the same number; (3) this whole thing uses a web-page which doesn’t render the whole document object model (DOM) at once.
It’s the last one which really had me stumped for a while.
OK. I chose to use the Selenium library to navigate these pages [my intro to using Selenium can be found here] and the first steps are pretty straight forward. To start with we can define functions to start and stop a selenium web-browser driver:
def init_driver(): driver = webdriver.Chrome() driver.wait = WebDriverWait(driver, 5) return driver def close_driver(driver): driver.close() return
Once our driver is initiated can then navigate to the TfGM timetables webpage, which looks like this:

It’s easy to identify the search box elements in the HTML using “inspect element” in the browser. We only need to fill in the top box, which has id='busServiceSearch'
. We enter the bus route number we’re looking for and click the search button (class='btn'
).
def enter_bus_number(driver, number): driver.get("https://my.tfgm.com/#/timetables/") search_field = driver.find_element_by_id("busServiceSearch") search_field.send_keys(number) driver.implicitly_wait(1) driver.find_element_by_class_name("btn").click() return
So far, so good. Once we’ve clicked the search button, the web page we see in the browser will display a list of possible bus routes (search results) that have the number we entered. For example, if we had entered “1” we would see:
Even though it looks completely different this is not a new web-page, it has exactly the same URL as the previous one with the search boxes, and if we downloaded the HTML (either using the browser or using driver.page_source
in selenium) we wouldn’t see any elements corresponding to the search results.
However, if we use “inspect element” in the browser to see the HTML for each search result, it will look like this:
<li class="ng-scope" ng-repeat="bus in bus.routes" ng-keydown="kbClickHandler($event);" ng-click="selectTimetable(bus.uid)" aria-label="1. BLACKBURN - BOLTON. Transdev Lancashire United." tabindex="0"> <span class="timetable-code ng-binding"> 1 </span> BLACKBURN - BOLTON <span class="timetable-operator ng-binding" ng-show="bus.operatorLink === undefined"> Transdev Lancashire United </span> <span class="timetable-operator ng-hide" ng-show="bus.operatorLink !== undefined"> <a href="" target="_blank" class="ng-binding"> Transdev Lancashire United </a> </span></li>
But if we ran:
driver.find_element_by_class_name("ng-scope")
in Python, it wouldn’t find anything.
Searching through layers with Selenium
The reason is that this is a complex webpage that runs scripts and has multi-level HTML source. I’m probably going to get the terminology wrong here, but as I understand things, selenium only natively sees the top layer of HTML.
It is possible to see the whole thing but you have to execute a selenium script by using something like this:
fullhtml = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
To search for html object attributes across all levels of HTML, you need to use the selenium find_element
command but with a slightly different syntax:
driver.find_elements_by_xpath('//*[@ng-repeat="bus in bus.routes"]')
(I’ve just used the ng-repeat
attribute here because the class ng-scope
is not unique to the search result items.)
Putting it into a function looks like this:
def select_route(driver, inroute): # set a wait time for the driver [10 sec here]: wait = WebDriverWait(driver, 10) try: # keep checking to see if the page has loaded yet: wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@ng-repeat="bus in bus.routes"]'))) # when it's loaded, extract the search result elements: routes = driver.find_elements_by_xpath('//*[@ng-repeat="bus in bus.routes"]') # extract the description of the route from each search result: endpoints=[] for route in routes: endpoints.append(route.text.split('\n')[1]) # do fuzzy string matching: endpoint = process.extractOne(inroute, endpoints)[0] # extract matching element and click it: for route in routes: if (route.text.split('\n')[1]==endpoint): route.click() time.sleep(5) # wait break except TimeoutException: print 'timeout' return
Fuzzy String Matching
You can see from the HTML extract above that there are three text components associated with each search result, e.g.
1 BLACKBURN - BOLTON Transdev Lancashire United
I’ve picked out the one corresponding to the route endpoints using route.text.split('\n')[1]
. I then need to match it to my own description of the route. That’s where the second part of this post comes in.
The name of each bus route contains a number and an endpoint. For example, the first bus route listed on TfGM is ‘1-blackburn’, the second is ‘1-bolton’ and the third is ‘1-piccadilly’. ‘1-blackburn’ and ‘1-bolton’ are the same bus going in opposite directions; ‘1-piccadilly’ is a circular route.
In order to match the routes I took from the TfGM bus routes web-page to the info on the TfGM timetables web-page I need to match those names to the endpoints. They’re not exactly the same, so I require a fuzzy string matching algorithm. I was going to write my own but, thanks to a pointer from this blog post, I discovered that (of course) there’s already a Python library to do it: the unfortunately named FuzzyWuzzy, which is pip installable:
pip install fuzzywuzzy pip install python-Levenshtein
Using it is incredibly easy. To test an input string (teststring
) against a list of options (listofstrings
) and return the one most likely match:
from fuzzywuzzy import process endpoint = process.extractOne(teststring, listofstrings)[0]
…and that’s it. What the function actually does is to measure the Levenshtein distance between the teststring
and each option in the listofstrings
.
That final click should have brought us to the point where we can see the timetable information. To extract it we can just employ the same approach again:
def get_timetable_info(driver): wait = WebDriverWait(driver, 10) try: wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@ng-repeat="stop in timetable.current.stops"]'))) # extract stop timetable elements from page: stops = driver.find_elements_by_xpath('//*[@ng-repeat="stop in timetable.current.stops"]') allstops=[] for stop in stops: # get name of stop: stopname = stop.find_element_by_class_name("timetable-stop").get_attribute("title") # get actual times bus stops at stop: times = stop.find_elements_by_class_name("timetable-time") stoptimes = [time.text for time in times] # get number of times bus stops at stop in one day: ntime = [time for time in stoptimes if time] # put info into a dict: stopinfo={} stopinfo['stop name'] = stopname stopinfo['daily freq'] = ntime stopinfo['stop times'] = stoptimes # add the dict into a list: allstops.append(stopinfo) except TimeoutException: print 'timeout' allstops=[] return allstops
Putting it all together
With these functions defined we can simply call them one at a time to extract the timetable data.
if __name__ == "__main__": driver = init_driver() route = '1-blackburn' bus = route.split('-')[0] origin = route.split('-')[1] enter_bus_number(driver, bus) select_route(driver, origin) allstops = get_timetable_info(driver) close_driver(driver)
And that’s it for now. Then for the blog this.