# Gaussian Process Modelling in Python

Non-linear regression is pretty central to a lot of machine learning applications. However, when you don’t know enough/anything about the actual physical parametric dependencies of a function it can be a bit of a show-stopper.

But… what if you could predict the value of a function at any point based only on its value at other points?

I’m not talking about fitting a physically motivated model. I’m talking about assuming that the behaviour of a function depends solely on the covariance between measurements at different separations.

[Note: this post follows on from a previous post.]

### Gaussian Process Modelling

If we make that assumption, then we can invert the probability statement

$p(y_i) = N (\mu_i, K_{ij})$

to calculate the value of a hypothetical measurement at a test position, $x_{\ast}$, based on existing measurements, y, at positions, x.

Key to this are two things:

1. Knowing the form of the covariance a priori; or, having sufficient example data points to calculate it;
2. Having a sufficient number of example data points to provide fixed points in the function a priori. These are our training data.

Normally if you can satisfy the second of these requirements, the first follows naturally.

Inverting the above probability statement allows us to write the following equations. These give us the posterior mean ($m_{\ast}$) and the posterior variance ($C_{\ast}$) at that point, i.e. the value and the uncertainty on that value:

$\mathbf{m_{\ast}} = \mathbf{ K(x_{\ast},x)^T K(x,x)^{-1} y } \\ \mathbf{C_{\ast}} = \mathbf{ K(x_{\ast},x_{\ast}) - K(x_{\ast},x)^T K(x,x)^{-1} K(x_{\ast},x) }$

This is more simply written as:

$\mathbf{m_{\ast}} = \mathbf{ k^T_{\ast} K^{-1} y } \\ \mathbf{C_{\ast}} = \mathbf{ k(x_{\ast},x_{\ast}) - k^T_{\ast} K^{-1} k_{\ast} }$

assuming that the original measurements have zero mean. If they don’t it becomes:

$\mathbf{m_{\ast}} = \mu_{\ast} + \mathbf{ k^T_{\ast} K^{-1} (y - \mu) } \\ \mathbf{C_{\ast}} = \mathbf{ k(x_{\ast},x_{\ast}) - k^T_{\ast} K^{-1} k_{\ast} }.$

To understand where this comes from in more detail you can read Chapter 2 of Rasmussen & Williams.

So, how do we actually practically do this in Python?

### Prediction in Python

In a previous post I went through the steps for simulating covariate data in Python.

We ended up with some covariate data that looked like this:

If we take the final realization from these data, which has $\lambda = 5$, and select 5 points from it as our training data then we can calculate the posterior mean and variance at any other point based on those five training points.

[Note: This code continues directly on from the code in the previous post]

First off, let’s randomly select our training points and allocate all the data positions in our realisation as either training or test:

# set number of training points
nx_training = 5

# randomly select the training points:
tmp = np.random.uniform(low=0.0, high=2000.0, size=nx_training)
tmp = tmp.astype(int)

condition = np.zeros_like(x)
for i in tmp: condition[i] = 1.0

y_train = y[np.where(condition==1.0)]
x_train = x[np.where(condition==1.0)]
y_test = y[np.where(condition==0.0)]
x_test = x[np.where(condition==0.0)]


Next we can use those 5 training data points to make a covariance matrix using the function we defined in the previous post:


# define the covariance matrix:
K = make_K(x_train,h,lam)



To calculate the posterior mean and variance we’re going to need to calculate the inverse of our covariance matrix. For a small matrix like we have here, we can do this using the numpy library linear algebra functionality:


# take the inverse:
iK = np.linalg.inv(K)


However, be careful inverting larger matrices in Python. The numerical stability of the numpy linear algebra inversion is pretty poor. The scipy linear algebra inversion is better because it always uses BLAS/LAPACK, but whenever possible don’t invert a matrix at all.

And now we’re ready to calculate the posterior mean and variance (or standard deviation, which is the square-root of the variance):


m=[];sig=[]
for xx in x_test:

# find the 1d covariance matrix:
K_x = cov_kernel(xx, x_train, h, lam)

# find the kernel for (xx,xx):
k_xx = cov_kernel(xx, xx, h, lam)

# calculate the posterior mean and variance:
m_xx = np.dot(K_x.T,np.dot(iK,y_train))
sig_xx = k_xx - np.dot(K_x.T,np.dot(iK,K_x))

m.append(m_xx)
sig.append(np.sqrt(np.abs(sig_xx))) # note sqrt to get stdev from variance



Let’s see how we did. Here I’m going to plot the training data points, as well as the original realisation (dashed line) that we drew them from, on the left. On the right I’m going to plot the same data plus the predicted posterior mean (solid line) and the standard deviation (shaded area).


# m and sig are currently lists - turn them into numpy arrays:
m=np.array(m);sig=np.array(sig)

# make some plots:

# left-hand plot
ax = pl.subplot(121)
pl.scatter(x_train,y_train)  # plot the training points
pl.plot(x,y,ls=':')        # plot the original data they were drawn from
pl.title("Input")

# right-hand plot
ax = pl.subplot(122)
pl.plot(x_test,m,ls='-')     # plot the predicted values
pl.plot(x_test,y_test,ls=':') # plot the original values

# shade in the area inside a one standard deviation bound:
ax.fill_between(x_test,m-sig,m+sig,facecolor='lightgrey', lw=0, interpolate=True)
pl.title("Predicted")

pl.scatter(x_train,y_train)  # plot the training points

# display the plot:
pl.show()



We can see that the prediction is pretty good. It’s not exactly the same as the original realisation, but then again it would be pretty surprising if we could predict things perfectly based on incomplete data.

The shaded regions of uncertainty are also useful because they tell us how much accuracy we should expect in any given range of positions. Note that the further away from our training data we go, the larger the uncertainty becomes. On the right hand side it explodes because we’ve run out of training data points to constrain the posterior mean.

Statistically we should not expect our original realisation to lie within the shaded region all the time: it’s a one sigma limit, i.e. we should expect the values of the original realisation to lie within one sigma of the posterior mean ~68% of the time.