Classification is not something I do a lot of day to day, so when I was asked to give a lecture on random forest classification using
scikit_learn I went looking for a random example to help me play around.
In my lecture I (of course) started off with the canonical
scikit_learn classification example using Fisher’s Iris Data, but I also wanted to include something new – but simple enough for students to follow.
But where to go for a dataset? Well, I’ve always wanted to start using the data from the UK government open data portal but haven’t really had a reason to. Bam, reason.
I was doing things on rather short notice (i.e. I’d left it a bit late) so I didn’t trawl the website for ages. I went straight for something easy: weather data!
The site hosts historical data from weather stations around the UK that, in some cases, goes back to 1853. I was sold.
I wanted a simple example, so I just picked 4 out of the 37 available datasets. I deliberately tried to pick quite well separated stations and these are the ones I ended up with:
These are pretty simple datasets, with structured numerical data. You have to be a bit careful though as some datasets have missing values (see the little exclamation mark next to Cwmystwyth above?) and there’s a little bit of cleaning required. Otherwise they look like this:
I just saved each one as a text file. I prefixed each line of the header information (the descriptive bit at the top) with a
Right, let’s get cracking. For starters we’re probably going to need some libraries. Both
First the really standard stuff:
import numpy as np # for array stuff import pylab as pl # for plotting stuff
Then we want the random forest classifier from
from sklearn.ensemble import RandomForestClassifier as RFC
and finally we need the pandas library because
import pandas as pd # for data formatting
We can then specify the names of all of those data files from the weather stations:
datfile1 = './DATA/Lerwick.dat' datfile2 = './DATA/Eastbourne.dat' datfile3 = './DATA/Camborne.dat' datfile4 = './DATA/Cwmystwyth.dat'
and define a function to read them. Note that when I was cleaning the data up I replaced missing data values with
"---" in my textfiles.
def read_data(datafile): # open the file as read only infile = open(datafile,'r') # initialise some empty lists for the different parameters: year=;mm=;tmax=;tmin=;af=;rain=;sun= # loop through the lines in the file: while True: # read each line # if there's no line to read (end of file), exit line=infile.readline() if not line: break # remove timestamps with missing data: if '---' not in line: # split each line into pieces: items = line.split() # remove header info: if (items!='#'): year.append(float(items)) # remove qualified measurements: if (items[-1]!='#') and (items[-1]!='*'): mm.append(float(items)) else: mm.append(float(items[:-1])) if (items[-1]!='#') and (items[-1]!='*'): tmax.append(float(items)) else: tmax.append(float(items[:-1])) if (items[-1]!='#') and (items[-1]!='*'): tmin.append(float(items)) else: tmin.append(float(items[:-1])) if (items[-1]!='#') and (items[-1]!='*'): af.append(float(items)) else: af.append(float(items[:-1])) if (items[-1]!='#') and (items[-1]!='*'): rain.append(float(items)) else: rain.append(float(items[:-1])) if (items[-1]!='#') and (items[-1]!='*'): sun.append(float(items)) else: sun.append(float(items[:-1])) # convert the lists into numpy arrays: year=np.array(year) mm = np.array(mm) tmax = np.array(tmax) tmin = np.array(tmin) af = np.array(af) rain = np.array(rain) sun = np.array(sun) # stack all the arrays of data into one big array data = np.vstack((year,mm,tmax,tmin,af,rain,sun)) return data
We can then use this function to read all the data:
# Lerwick data1 = read_data(datfile1) # Eastbourne: data2 = read_data(datfile2) # Camborne: data3 = read_data(datfile3) # Cwmystwyth: data4 = read_data(datfile4)
We also need to specify which dataset comes from which target, i.e. location. These are numerical labels and my order is:
(except Python is zero-indexed, so they become 0,1,2,3 rather than 1,2,3,4)
# make an array of zeros: target1 = np.zeros(data1.shape) # make an array of ones: target2 = np.zeros(data2.shape) target2+=1 # make an array of twos: target3 = np.zeros(data3.shape) target3+=2 # make an array of threes: target4 = np.zeros(data4.shape) target4+=3
Then we can stack all of these into a neat master array of data and a neat master array of targets:
# Stack all the data into a single array and make sure it's the right way around... data = np.hstack((data1,data2,data3,data4)) data = data.transpose() target = np.hstack((target1,target2,target3,target4))
Once we’ve got all the data into an array we must remember to specify what each thing is, i.e. what are our features (variables) and what are our targets (classes):
We can then take a look at our data if we like:
# choose parameter pair: p1=1;p2=2 # plot the data for Lerwick (red): pl.scatter(data[np.where(target==0),p1], data[np.where(target==0),p2], s=10, label=target_names, c='r') # plot the data for Eastbourne (blue): pl.scatter(data[np.where(target==1),p1], data[np.where(target==1),p2], s=10, label=target_names, c='b') # plot the data for Camborne (green): pl.scatter(data[np.where(target==2),p1], data[np.where(target==2),p2], s=10, label=target_names, c='g') # plot the data for Cwmystwyth (yellow): pl.scatter(data[np.where(target==3),p1], data[np.where(target==3),p2], s=10, label=target_names, c='y') # Label the axes: pl.xlabel(feature_names[p1]) pl.ylabel(feature_names[p2]) # Add a legend pl.legend(loc='best') # display the plot: pl.show()
and it might look something like this:
To use the
scikit_learn random forest classifier we need to put our data into a
pandas dataframe, but that’s pretty simple.
df = pd.DataFrame(data, columns=feature_names)
At this point you might be wondering: what exactly are we going to classify? Well, I wanted this example to work in a similar way to the Fisher’s Iris Classification, so what I’m going to do is select a fraction of the dataset as training data and use the classifier trained on those data to predict the locations of all the other measurements.
So let’s specify a random subset of our data as our training data points.
We can specify a particular fraction (“frac”) of the data to be the training data set , e.g. “frac=0.75” means we use 75% of the data; “frac=0.5” means we use 50% of the data.
frac = 0.75 df['is_train'] = np.random.uniform(0, 1, len(df)) <= frac
The pandas data frame also allows us to specify the class associated with each dataset as it's actual target name, ie. it takes the list of numerical values in iris.target and makes a list of strings by matching the numerical indices with their corresponding name in iris.target_names.
df['places'] = pd.Categorical.from_codes(target, target_names)
We can unpack the training data points and the test data points into their own data frames:
train, test = df[df['is_train']==True], df[df['is_train']==False]
and specify the “features” (the variables):
features = df.columns[0:7]
After that’s done we are ready to set up our random forest classifier. We can specify the amount of processing we want to use (“n_jobs”) and the number of decision trees in our forest (“n_estimators”).
forest = RFC(n_jobs=2,n_estimators=100)
We now need to convert our list of classifiers back into a list of integers. Why do we need to do this? Good question.
y, _ = pd.factorize(train['places'])
We’re now ready to fit to / learn from our training data:
and once that’s done we can predict what class our test data points should be:
preds = target_names[forest.predict(test[features])]
So… let’s see how we did. We can use pandas to crossmatch the original dataset with our predicted classifications made using the random forest.
print pd.crosstab(index=test['places'], columns=preds, rownames=['actual'], colnames=['preds'])
One thing that might be useful to know would be which variables (“features”) were most important for the classifier. Let’s extract that information:
importances = forest.feature_importances_ indices = np.argsort(importances)
…and let’s plot it so we can see easily what was crucial.
pl.figure(1) pl.title('Feature Importances') pl.barh(range(len(indices)), importances[indices], color='b', align='center') pl.yticks(range(len(indices)), features[indices]) pl.xlabel('Relative Importance') pl.show()
It seems that pretty much everything – except perhaps air frost – was important here.
So what have we learned? Probably not much.
- That the weather in Camborne is rather ambiguous.
- That if you don’t want someone to know where you are from your weather station measurements, you’re better off living in Camborne than Lerwick.
- That you can classify things based on really simplistic data.