Research funding in the UK is very transparent. In fact you can access details of every RCUK (Research Councils UK) funded project through their online gateway. Like a lot of UK government data the content is available under the Open Government Licence v3.0, except where otherwise stated.
The Newton Fund is part of the UK’s official development assistance (ODA) and “uses science and technology partnerships to promote economic development and social welfare” in countries around the world.
Most – although not all – of the Newton Fund projects are run by research teams in universities and other academic institutes around the UK. I’m involved in a couple but I realised recently that I don’t know much about other Newton projects aside from my own. So I went and had a look on the RCUK webpage to find out what else is going on – and where it’s going on. There are a lot of different projects and so I decided to make a Google maps heat map of where Newton Fund projects are lead from around the UK to help me visualise the data.
Getting the Data
I used the RCUK gateway to search for all projects funded through the UK Newton Fund.
I downloaded the results of my search as a CSV file. To read it in Python I used the csv
library, which is pip installable.
First off let’s load some libraries:
import csv import numpy as np
Then we can use the csv
library to load the spreadsheet we downloaded from the RCUK gateway. 'rU'
here stands for “Universal Read”, which you need to use sometimes when you’re reading a CSV file on a different operating system from the one that was used to write it (or something, it just works).
csvfile = open('classificationprojectssearch-1508233894806.csv', 'rU')
The first row is the column headings, so let’s skip that:
next(csvfile, None)
'"\x95\xc8\xc0""FundingOrgName""",ProjectReference,LeadROName,Department, ProjectCategory,PISurname,PIFirstName,PIOtherNames,PI ORCID iD,StudentSurname,StudentFirstName,StudentOtherNames,Student ORCID iD,Title,StartDate,EndDate,AwardPounds,ExpenditurePounds,Region,Status,Classifications 1,Classifications 2,Classifications 3,Classifications 4,Classifications 5,Other Classifications,GTRProjectUrl,ProjectId,FundingOrgId,LeadROId, PIId\n'
Then we can set up a reader to read the rest of the data:
reader = csv.reader(csvfile)
The lead institute for each project is in the third column along. I’m going to read all of them out into a numpy
array:
city=[] for row in reader: city.append(str(row[2])) city=np.array(city)
Ok, data obtained.
Mapping the Data
I’m going to use the name of each lead institute to identify where in the country it is. To do that I’ll use the geolocator
tool from the geopy
library.
To start with I’ll import the library and the Nominatim
class:
from geopy import geocoders from geopy.geocoders import Nominatim
Then let’s initiate the Nominatim
class:
geolocator = Nominatim()
My plan is to extract the longitude and latitude of each institute, so I’ll start by declaring the empty lists I’ll be filling with those values. I also want to keep a list of institutes that Nominatim()
can’t find, so I’ll initiate a list for that too.
heat_lats = [];heat_lons = [] notaplace = []
Now I’m just going to loop through the institutes and ask geolocator
to find them. I’ve added in a couple of if statements to help Nominatim()
along. For example, one of the places that it couldn’t find was “Queen Mary, University of London”, so I’ve put a tag in to catch unknown institutes that have London in the name. Two other offenders were the “National Inst of Agricultural Botany”, which google tells me is in Cambridge, and SOAS.
There are a few other institutes that are not so easy to pin down with Nominatim()
. If I wanted to write a more complex script they could probably be accommodated, but in the interests of simplicity I’m just leaving them out here. Note: the Google maps GoogleV3()
geolocator doesn’t have the same problem (see below).
for place in city: location = geolocator.geocode(place) if location==None: aplace = False if "London" in place: location = geolocator.geocode('London UK') aplace = True elif "National Inst of Agricultural Botany" in place: location = geolocator.geocode('Cambridge UK') aplace = True elif "School of Oriental & African Studies" in place: location = geolocator.geocode('London UK') aplace = True else: print place," is not a place" notaplace.append(place) if aplace: # extract the latitude from the location object: heat_lats.append(location.latitude) # extract the longitude from the location object: heat_lons.append(location.longitude)
geopy
: the first was that the request quota is really low. I couldn’t find an official value on the web, but using Nominatim()
it seems to be 1000/day. I fixed this by switching to the Google Maps GoogleV3()
locator:
geolocator = geocoders.GoogleV3(api_key=apikey)
To use this method you need to get a Google API key from here. You must have Google Maps Geocoding API enabled in the API library for it to work. Using Google maps will increase your quota to 2500/day.
The second issue I had was that Geocoder kept timing out. To prevent this:
geolocator = geocoders.GoogleV3(api_key=apikey, timeout=10)
Making a Heat Map
I’m going to make a heat map of all the institute locations, but first I need to specify a central (lat,lon) for the map itself:
uklat = 55.3781 uklon = -3.4360
I’m going to use the gmplot
library to make the heat maps. So let’s import that:
import gmplot
…and I’ll initiate my map using the central (lat,lon) pair I specified and a zoom value (I picked this using trial and error).
gmap = gmplot.GoogleMapPlotter(uklat, uklon, 6)
Then making the heat map is really simple, we use the inbuilt library function from gmplot
and pass it our list of longitudes and latitudes:
gmap.heatmap(heat_lats, heat_lons)
To draw the map we need to invoke the draw()
function from gmplot
. This will write out an html
file of our heat map. You can then load it using whichever browser you prefer.
gmap.draw("newtonmap.html")
The Newton Fund project map looks like this:
Left:
Nominatim()
; Right: GoogleV3()
The two maps are subtly different because of the different geolocators. There are a few research institutes that aren’t easily findable by Nominatim()
since they have multiple locations and/or are outside the UK:
Wits Health Consortium is not a place
MRC/UVRI Uganda Research Unit on AIDS is not a place
NERC Centre for Ecology and Hydrology is not a place
NERC British Geological Survey is not a place
National Inst for Communicable Diseases is not a place
NERC British Antarctic Survey is not a place
Overseas Development Inst ODI (Internat) is not a place
H R Wallingford Ltd is not a place
Scottish Universities Environmental Research Centre (SUERC) is not a place
Sefako Makgatho Health Sciences Universi is not a place
SAHFOS is not a place
The Aurum Institute is not a place
These 11 institutes are not included in the Nominatim()
heat map.
The GoogleV3()
geolocator finds everything. These ones aren’t on the map because they’re in other countries:
Wits Health Consortium (South Africa)
MRC/UVRI Uganda Research Unit on AIDS (Uganda – obviously)
National Inst for Communicable Diseases (South Africa)
Sefako Makgatho Health Sciences Universi (South Africa)
The Aurum Institute (South Africa)
These ones are identified as in the UK and GoogleV3()
puts them here:
NERC Centre for Ecology and Hydrology (Lancaster)
Overseas Development Inst ODI (London)
SAHFOS (Plymouth)
NERC British Geological Survey (Nottingham)
NERC British Antarctic Survey (Cambridge)
H R Wallingford Ltd (Wallingford)
Scottish Universities Environmental Research Centre (Glasgow)
On first glance the heat map looks very heavily concentrated towards the South East. From the numbers (assuming the locations as determined by GoogleV3()
), 84 of the 469 projects (~18%) are in London, which equates to ~15% of the overall funding.
This might look like bias in the funding mechanism, but we need to look at the prior information. For example, if we take the overall number of postgraduate students as a proxy for the size of the research community in a particular place then London is home to ~20% of the UK research community.
I’ve based this statement on the information about UK student numbers on wikipedia, which tells me that there are 168 universities in the UK as well as their respective undergraduate and postgraduate student numbers. My simple estimate is not perfect, because I’ve included postgraduates across all fields – not only those appropriate for the Newton Fund – but then again I’ve done that across the whole of the UK. This means that I’m assuming that the ratio of Newton Fund eligible research groups to non-Newton Fund eligible research groups is a constant across the country.
Gender Balance
The data from the RCUK gateway doesn’t include the gender of the PI, but it does include their name – and importantly it includes their first name.
I used the gender.py
script to query the genderize.io
API and return the gender of each PI’s first name.
genderize.io
is great, but note that it has a rate limiter of 1000 names/day.
I ran into problems with the limit of 1000 names/day, mainly because the interface was a really glitchy and kept dropping its response so I had to re-run requests, which was frustrating. In fact I couldn’t loop over every entry in the data table in a single for
loop without an error, but splitting the data over 4 loops of 100 people each seemed to work.
from gender import getGenders
genders=[] for person in name: gender = getGenders(person.split()[0]) genders.append(gender[0][0]) time.sleep(1)
I’ve passed the string person.split()[0]
because some PIs have their middle names listed as well. This causes a problem for genderize.io
. For example, “Christopher” would return “male”, but “Christopher Mark” would return “None”.
This simple analysis indicates that ~27% of Newton Fund projects have female PIs. Only 30 out of the 469 projects came back with a ‘None’, i.e. unknown, gender.
Let’s break the results down by funding council:
- AHRC (Arts and Humanities Research Council)
- BBSRC (Biotechnology and Biosciences Research Council)
- EPSRC (Engineering and Physical Sciences Research Council)
- ESRC (Economic and Social Research Council)
- MRC (Medical Research Council)
- NERC (Natural Environment Research Council)
- STFC (Science and Technologies Facilities Council)
Doing a calculation like this for each one:
print "AHRC" ahrc_gender = genders[np.where(council=='AHRC')] ahrc_girls = ahrc_gender[np.where(ahrc_gender=='female')].shape[0] print ahrc_gender.shape[0], ahrc_girls," (",100.*float(ahrc_girls)/float(ahrc_gender.shape[0]),"% )"
etc.
For each research council I’ve listed the name, the total number of projects for that research council, the total number with female PIs (as determined by genderize.io
) and the percentage of the total projects that number represents.
AHRC 21 12 (57.1%)
BBSRC 52 13 (25.0%)
EPSRC 45 9 (20.0%)
ESRC 77 35 (45.5%)
MRC 142 35 (24.6%)
NERC 117 20 (17.1%)
STFC 15 2 (13.3%)
Again, there are priors which will influence these.
Specifically, the gender ratio in the research areas covered by each council. I don’t have those numbers (yet), but a quick skim through the literature suggests that some of the numbers that initially appear worst may actually be representative, for example the Women’s Engineering Society published a report which included HESA data showing that 20% of academic engineers were women. That number is pretty similar to the EPSRC gender balance for Newton projects above.
Then for the blog this.
2 Replies to “Newtonian Funding”