Mind the Gender PayGap

I have been wondering how the gender pay gap in the university sector compares with the average across other sectors, so I thought I’d look at the data.

Last update: 23 March 2018.

paygap_plot
Visualising the mean gender pay gap in the university sector (dark blue) compared with the non-university sector (light blue) for different regions of the UK.

The data I’m using come from the current UK returns on gender pay differences. Every company in the UK with more then 250 employees is required to submit their information to this survey and it can be accessed online here:

paygap_gov
https://gender-pay-gap.service.gov.uk/Viewing/search-results

By hitting the Download data tab it’s possible to download the entire dataset as a CSV file.

I wanted a quick way to visualise the difference in the mean pay gap between the university sector and the regional average across all non-university sectors for different areas of the country.

As I post this, the deadline for returns hasn’t passed yet [deadline: April 4th 2018], but there’s already quite a bit of data available already and the media have been trying to make sense of it already.

A little bit of work gave me the plot above. Here’s how I made it.

Getting Started

To start with we’ll need some standard libraries:

import csv
import numpy as np
import requests
import json

…and I’m planning to generate my final plot using the Plotly library, so I’ll import that as well as the tools for making map overlays.

Plotly works best with data stored in pandas format, so I’m also going to import the pandas library.

Both of these are available using pip.

import plotly.plotly as py
import plotly.graph_objs as go
import pandas as pd

I’ve saved the CSV data as a file inventively named data.csv:

csvfile = open('data.csv', 'rU')

and we can view the header data whilst simultaneously skipping over it:

next(csvfile, None)

paygap_header

Then I’ll set up a proper CSV data reader using the csv library that I imported earlier:

reader = csv.reader(csvfile)

Extracting the Data

Once the reader is in place we can start extracting data from the file. There’s a bunch of stuff in there but I’m just going to start with:

  • the name of each company
  • the address of each company
  • the mean pay difference for each company

It’s worth noting that a positive pay difference indicates men being paid more than women and vice versa for a negative pay difference.

address=[];diffmean=[];name=[]
for row in reader:
    name.append(str(row[0]))
    address.append(str(row[1]))
    diffmean.append(float(row[4]))

names=np.array(name)
addresses=np.array(address)
diffmean=np.array(diffmean)

Regions of Interest

Using the address of each company I’m going to extract the last line which contains the postcode. Helpfully the data are standardised to ensure that this is always the case.

I’m also going to remove any carriage returns and white space from each postcode so that, for example, ‘\nM13 9PL’ becomes ‘M139PL’, and I’m going to make an array of postcodes that corresponds to my arrays of names, addresses and pay differences.

postcodes=[]
for address in addresses:
    tmp = address.split(',')[-1]  # extract last line of address
    tmp = tmp.strip('\n')     # get rid of leading carriage return
    tmp = tmp.replace(" ","") # get rid of white spaces
    postcodes.append(tmp)

postcodes=np.array(postcodes)

I’ve extracted the postcodes so that I can find out where in the UK each company is located, using one of my favourite APIs: postcodes.io

def get_lonlat(postcode):

    # set requests url:
    base = "http://api.postcodes.io/postcodes/"
    query = postcode

    # get response as a dictionary:
    r = requests.get(base+query)
    page = json.loads(r.text)

    # check error status of response:
    if page['status']==200:
        # 200 means all is well, but it doesn't mean
        # that you'll actually get a postcode...
        if page['result']==None:
            lon='-1'
            lat='-1'

        else:
            lon = page['result']['longitude']
            lat = page['result']['latitude']

    else:
        lon='-1'
        lat='-1'

    return lon,lat

I then loop through all the postcodes and find the longitude and latitude associated with each postcode. Sometimes this just returns None. I noticed that several company addresses use deprecated postcodes.

lons=[];lats=[]
for postcode in postcodes:
    lon,lat = get_lonlat(postcode)
    lons.append(float(lon))
    lats.append(float(lat))

lons=np.array(lons)
lats=np.array(lats)

I’m also going to extract the region associated with each postcode, again using postcodes.io :

def get_region(postcode):

    # set requests url:
    base = "http://api.postcodes.io/postcodes/"
    query = postcode

    # get response as a dictionary:
    r = requests.get(base+query)
    page = json.loads(r.text)

    # check error status of response:
    if page['status']==200:
        # 200 means all is well, but it doesn't mean
        # that you'll actually get a postcode...
        if page['result']==None:
            reg='None'

        else:
            reg = page['result']['region']
            # check for Scotland and Wales:
            if (reg==None):
                if page['result']['country']=='Scotland': reg='Scotland'
                if page['result']['country']=='Wales': reg='Wales'

    else:
        reg='None'

    return reg

…and loop through all the addresses to make an array of regions.

regions=[]
for postcode in postcodes:
    regions.append(get_region(postcode))

regions=np.array(regions)

From this list of regions associated with individual companies I can make a master list of the different regions:

region_list = np.unique(regions)
print region_list

paygap_regions

You can see that there is one region called “None”. This is where postcodes have simply not returned a valid region – probably because they’re deprecated.

For each of the unique regions I calculate a mean longitude and latitude simply from the locations of all the companies in that region:

region_lons=[];region_lats=[]
for region in region_list:
    reg_lon = np.mean(lons[np.where(regions==region)])
    reg_lat = np.mean(lats[np.where(regions==region)])

    region_lons.append(reg_lon)
    region_lats.append(reg_lat)

region_lons = np.array(region_lons)
region_lats = np.array(region_lats)

Getting Sector Specific

My intention is to compare the gender pay gap at local universities with the average gender pay gap for each region. To do that I need to work out which companies are universities. There’s probably a more sophisticated way of doing this, but for now I’m just going to extract all companies that have the word “university” in their name.

def is_university(name):
    str1="University"
    str2="university"
    str3='UNIVERSITY'

    i = name.find(str1)
    j = name.find(str2)
    k = name.find(str3)

    if (i>=0) or (j>=0) or (k>=0):
        uni = 1
    else:
        uni = 0

    return uni

I then loop through all the companies in the dataset and assign them a true flag (1) if they are a university and a false flag (0) if they aren’t:

universities=[]
for name in names:
    universities.append(is_university(name))

universities = np.array(universities)

Panda-ing the Data

Now I’ve got all the information that I need I’m going to compile it into a dataset that I can use for plotting.

I find that the dict structure is super easy to use, so my first step puts all of the data into a list of dicts:

data=[]
for region in region_list:

    # dict for non-university data:
    data_line1 = {
        'region' : region,
        'uni' : 0,
        'meandiff' : 0,
        'lon' : region_lons[np.where(region_list==region)],
        'lat' : region_lats[np.where(region_list==region)]
    }

    # dict for university data:
    data_line2 = {
        'region' : region,
        'uni' : 1,
        'meandiff' : 0,
        'lon' : region_lons[np.where(region_list==region)],
        'lat' : region_lats[np.where(region_list==region)]
    }

    tmp_dm = diffmean[np.where(regions==region)]
    tmp_uni= universities[np.where(regions==region)]

    if len(tmp_dm[np.where(tmp_uni==1)])>0:
        uni_val = np.mean(tmp_dm[np.where(tmp_uni==1)])
    else:
        uni_val = 0.

    if len(tmp_dm[np.where(tmp_uni==0)])>0:
        non_uni_val = np.mean(tmp_dm[np.where(tmp_uni==0)])
    else:
        non_uni_val = 0.

    # write out non-uni data:
    data_line1['meandiff'] = non_uni_val
    data.append(data_line1)

    # write out uni data:
    data_line2['meandiff'] = uni_val
    data.append(data_line2)

But the Plotly example I’m hacking to make my visualisation works with a pandas dataframe. I decided that it was less work to turn my dict into a dataframe than to hack the Plotly code, so here’s the conversion:

# create a pandas dataframe:
regs=[];unis=[];difs=[];lngs=[];ltts=[]
for line in data:
    regs.append(line['region'])
    unis.append(line['uni'])
    difs.append(line['meandiff'])
    lngs.append(line['lon'])
    ltts.append(line['lat'])

d = {'Region':regs,'Uni':unis,'MeanDiff':difs,'Lon':lngs,'Lat':ltts}
df = pd.DataFrame(data=d)
df.head()

 paygap_df

Plotting the Data

I’m hacking the Plotly example from this webpage to visualise the UK pay gap data.

For each region I want to display the university (Case 1) vs. non-university (Case 2) data. I’m going to display them as scaled markers of two different colours.

I’ll start by defining an empty list for my plotting data in each Case, a list of colours corresponding to each Case, and a list of labels for each Case:

cases = []
colors = ['rgb(239,243,255)','rgb(107,174,214)']
sectors = {0:'Non-University',1:'University'}

I’m then going to loop through my Cases (backwards) adding all the plotting data for each region into the cases list:

for i in range(0,2)[::-1]:

    cases.append(go.Scattergeo(
        lon = df[ df['Uni'] == i ]['Lon'],
        lat = df[ df['Uni'] == i ]['Lat'],
        text = df[ df['Uni'] == i ]['MeanDiff'],
        name = sectors[i],
        marker = dict(
            size = df[ df['Uni'] == i ]['MeanDiff']*2,
            color = colors[i],
            line = dict(width = 0)
        ),
    ) )

When the data is displayed it will just show the markers. If we want to label each marker with (e.g.) the name of the region, we could add something like this:

cases[0]['text'] = df[ df['Uni'] == 1 ]['Region']
cases[0]['mode'] = 'markers+text'
cases[0]['textposition'] = 'bottom center'

However… because of the distribution of markers these labels don’t actually look very sensible. I left them out in the end.

Finally we just need to define the layout. I’ve chosen a hammer projection and I’ve zoomed in since the data I have is largely for England.

layout = go.Layout(
    title = 'UK Pay Gap Data 2017',
    geo = dict(
        resolution = 50,
        scope = 'uk',
        showframe = False,
        showcoastlines = True,
        showland = True,
        landcolor = "rgb(229, 229, 229)",
        countrycolor = "rgb(255, 255, 255)" ,
        coastlinecolor = "rgb(255, 255, 255)",
        projection = dict(
            type = 'hammer'
        ),
        lonaxis = dict( range= [ -5.0, +2.0 ] ),
        lataxis = dict( range= [ 50.0, 57.0 ] ),
        domain = dict(
            x = [ 0, 1 ],
            y = [ 0, 1 ]
        )
    )
)
​

…and make a plot:

fig = go.Figure(layout=layout, data=cases)
py.iplot(fig, validate=False, filename='UK Pay Gap Data 2017')

Which produces a plot like this:

paygap_plot

Basically if there is a dark ring around a light ring then the university sector is worse than the non-university sector average, and if there is a light ring around a dark ring then vice-versa.

The current status…

It seems that the pay gap overall is smallest in Wales, Scotland and the West Midlands, and that the university sector is better than average in those areas. However, there is currently only one university sector return for each of Wales and Scotland at present so it’s probably not a good idea to read too much into any regional comparison from those numbers at the moment.

The university sector gender pay gap seems to be comparatively worse in the East of England, the North West and the North East, with other regions being about even between the university sector and other sectors – apart from London where the pay gap is smaller for universities than other sectors [see caveats].

There are currently no returns from university sector employers in Northern Ireland.

This is the current status because of a number of caveats…

Caveats:

  1. I haven’t analysed the regional distribution of companies that have unrecognised/deprecated postcodes. These account for ~2% of the overall dataset.
  2. Some universities do not have the word “university” in their company name. Examples include Imperial College London, Royal Holloway College and the Oxbridge Colleges. I’ll include something to catch this in the final update.
  3. Not all employers have returned their data yet.

I will be updating the method and the results as more data appear.

 

One Reply to “Mind the Gender PayGap”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s