Spinning Stars II : Célérité

Gaussian Process Modelling is all very well, but when you’ve got lots of data points – and potentially lots of hyper-parameters – optimizing the values of those hyper-parameters can be very computationally expensive, i.e. slow

In my previous post I was using GPM to extract the rotation period of stars from Kepler data. For that case, I just used the scipy direct gradient optimization routine to fit the hyper-parameters; however, if we’d tried to map out the posterior probability space using MCMC it would have taken a couple of hours just to run the burn-in.

Fortunately, the new celerite GPM library has been formulated precisely to allow fast evaluation of the GP likelihood function. Hence the name.

This example follows directly from my previous post and you can use exactly the same dataset. The code below can follow on directly from where we plotted the dataset for the first time. It should look like this:

figure_1

Celerite

In this example we’re not going to use the george GPM library, we’re going to use the new celerite one. So we need to import it – and we need to import the celerite equivalent of kernels, which is the terms.

import celerite
from celerite import terms

In the paper there are two suggested kernels for modelling the covariance of the Kepler data (Eqs. 55 & 56). In the paper the authors fit Eq 56 and we are going to do the same. (If you want to fit Eq. 55, see my previous post)

k(\tau) = \frac{B}{1+C}\exp^{-\tau/L} \left[ \cos{\left( \frac{2\pi\tau}{P} \right)} + (1+C) \right]

This is the same as the CustomTerm described in the celerite documentation here.

There is one small discrepancy though – the exponent is expressed differently.

This doesn’t mean we need to change anything… except for our prior bounds. We’re going to apply those as logarithmic bounds so we will need to put a minus sign in front of them since:

\log(1/x) = -\log(x).

We can just import the description of the CustomTerm kernel exactly as it is described in the documentation:

import autograd.numpy as np

class CustomTerm(terms.Term):
    parameter_names = ("log_a", "log_b", "log_c", "log_P")

    def get_real_coefficients(self, params):
        log_a, log_b, log_c, log_P = params
        b = np.exp(log_b)
        return (
            np.exp(log_a) * (1.0 + b) / (2.0 + b), np.exp(log_c),
        )

    def get_complex_coefficients(self, params):
        log_a, log_b, log_c, log_P = params
        b = np.exp(log_b)
        return (
            np.exp(log_a) / (2.0 + b), 0.0,
            np.exp(log_c), 2*np.pi*np.exp(-log_P),
        )

We need to pick some first guess hyper-parameters. Because I’m lazy I’ll just start by setting them all to unity:

log_a = 0.0;log_b = 0.0; log_c = 0.0; log_P = 0.0
kernel = CustomTerm(log_a, log_b, log_c, log_P)

We then need to initiate celerite Gaussian processes with this kernel:

gp = celerite.GP(kernel, mean=0.0)

…and calculate the covariance matrix:

gp.compute(time)

print("Initial log-likelihood: {0}".format(gp.log_likelihood(value)))

Initial log-likelihood: 1691.35372509

Let’s see what the prediction of the posterior mean and variance look like:

t = np.arange(np.min(time),np.max(time),0.1)

# calculate expectation and variance at each point:
mu, cov = gp.predict(value, t)
std = np.sqrt(np.diag(cov))

…and plot them up:

ax = pl.subplot(111)
pl.plot(t,mu)
ax.fill_between(t,mu-std,mu+std,facecolor='lightblue', lw=0, interpolate=True)
pl.scatter(time,value,s=2)
pl.axis([0.,60.,-1.,1.])
pl.ylabel("Relative flux [ppt]")
pl.xlabel("Time [days]")
pl.show()

figure_4

Hyper-parameter Optimization

Now we’re ready to optimise the values of the hyper-parameters.

The paper says:

As with the earlier examples, we start by estimating the MAP parameters using L-BFGS-B

So let’s do that. We’ll use the scipy optimiser, which requires us to define a log(likelihood) function and a function for the gradient of the log(likelihood):

def nll(p, y, gp):

    # Update the kernel parameters:
    gp.set_parameter_vector(p)

    #  Compute the loglikelihood:
    ll = gp.log_likelihood(y)

    # The scipy optimizer doesn’t play well with infinities:
    return -ll if np.isfinite(ll) else 1e25
def grad_nll(p, y, gp):

    # Update the kernel parameters:
    gp.set_parameter_vector(p)

    #  Compute the gradient of the loglikelihood:
    gll = gp.grad_log_likelihood(y)[1]

    return -gll

I’m going to set bounds on the available parameters space, i.e. our prior volume, using the ranges taken from Table 4 of this paper.

import scipy.optimize as op

# extract our initial guess at parameters
# from the celerite kernel and put it in a
# vector:
p0 = gp.get_parameter_vector()

# set prior ranges
# Note that these are in *logarithmic* space
bnds = ((-10.,0.),(-5.,5.),(-5.,-1.5),(-3.,5.))

# run optimization:
results = op.minimize(nll, p0, method='L-BFGS-B', jac=grad_nll, bounds=bnds, args=(value, gp))
print np.exp(results.x)
print("Final log-likelihood: {0}".format(-results.fun))

[ 0.24136991 0.00673795 0.22313016 4.2429933 ]
Final log-likelihood: 4421.07974654

The key parameter here is the period, which is the fourth number along. We expect this to be about 3.9 and… we’re getting 4.24, so not a million miles off.

From the paper:

This star has a published rotation period of 3.88 ± 0.58 days, measured using traditional periodogram and autocorrelation function approaches applied to Kepler data from Quarters 0–16 (Mathur et al. 2014), covering about four years.

Let’s now pass these optimised parameters to celerite and recompute our prediction:

# pass the parameters to the george kernel:
gp.set_parameter_vector(results.x)
t = np.arange(np.min(time),np.max(time),0.1)

# calculate expectation and variance at each point:
mu, cov = gp.predict(value, t)
std = np.sqrt(np.diag(cov))
ax = pl.subplot(111)
pl.plot(t,mu)
ax.fill_between(t,mu-std,mu+std,facecolor='lightblue', lw=0, interpolate=True)
pl.scatter(time,value,s=2)
pl.axis([0.,60.,-1.,1.])
pl.ylabel("Relative flux [ppt]")
pl.xlabel("Time [days]")
pl.show()

figure_5

MCMC Optimization

I’m now going to use MCMC to optimise the hyper-parameters in a more thorough manner by making sure we properly explore the posterior probability distribution. I’ll be using the emcee library.

From now on we need to be more careful about how we deal with that slight difference between the CustomTerm and Eq. 56 from the paper. Keep your eye on it.

import emcee

# we need to define three functions:
# a log likelihood, a log prior & a log posterior.

First we need to define a log(likelihood). We’ll use the log(likelihood) implemented in the celerite library, which implements:

\ln L = -\frac{1}{2}(y - \mu)^{\rm T} C^{-1}(y - \mu) - \frac{1}{2}\ln |C\,| + \frac{N}{2}\ln 2\pi

(see Eq. 5 in this paper).

# set the loglikelihood:
def lnlike(p, x, y):

    ln_a = p[0]
    ln_b = p[1]
    ln_c = -1.*p[2]      # we pass ln(c) to the CustomKernel, ln(c) = -ln(L)
    ln_p = np.log(p[3])  # we're sampling linearly from P so we need to log it

    p0 = np.array([ln_a,ln_b,ln_c,ln_p])

    # update kernel parameters:
    gp.set_parameter_vector(p0)

    # calculate the likelihood:
    ll = gp.log_likelihood(y)

    # return
    return ll if np.isfinite(ll) else 1e25

We also need to specify our parameter priors. I started off with priors the same as those specified in Table 3 of the paper:

table3

But I ended up expanding the prior on \ln C to be U(-15.0,5.0) because I wanted to see if the posterior peaked anywhere (it didn’t). I also amended the prior on \ln L to be U(0.0,5.0) because it seemed to favour smaller values. This may not be a sensible thing to do physically…

 

# set the logprior
def lnprior(p):

    # These ranges are adapted from Table 3
    # of https://arxiv.org/pdf/1703.09710.pdf

    lnB = p[0]
    lnC = p[1]
    lnL = p[2]
    lnP = np.log(p[3])

    # adapted prior from paper
    if (-10.<lnB<0.) and (-15.<lnC<5.) and (0.0<lnL<5.) and (-3.<lnP<5.):
        return 0.0

    return -np.inf
    #return gp.log_prior()

We then need to combine our log likelihood and our log prior into an (unnormalised) log posterior:

# set the logposterior:
def lnprob(p, x, y):

    lp = lnprior(p)

    return lp + lnlike(p, x, y) if np.isfinite(lp) else -np.inf

ok, now we have our probability stuff set up we can run the MCMC. We’ll start by explicitly specifying our Kepler data as our training data:

x_train = time
y_train = value

The paper then says:

initialize 32 walkers by sampling from an isotropic Gaussian with a standard deviation of 10^{-5} centred on the MAP parameters.

So, let’s do that:

# put all the data into a single array:
data = (x_train,y_train)

# set your initial guess parameters
# as the output from the scipy optimiser
# remember celerite keeps these in ln() form!

# we will sample from:
# lnB, lnC, lnL, P
p = gp.get_parameter_vector()
initial = np.array([p[0],p[1],-1.*p[2],np.exp(p[3])])
print "Initial guesses: ",initial

# initial log(likelihood):
init_logL = gp.log_likelihood(y_train)

# set the dimension of the prior volume
# (i.e. how many parameters do you have?)
ndim = len(initial)
print "Number of parameters: ",ndim

# The number of walkers needs to be more than twice
# the dimension of your parameter space.
nwalkers = 32

# perturb your inital guess parameters very slightly (10^-5)
# to get your starting values:
p0 = [np.array(initial) + 1e-5 * np.random.randn(ndim)
      for i in xrange(nwalkers)]

Initial guesses: [-1.37305713 -4.60605199 1.55092649 4.31624974]
Number of parameters: 4

We can then use these inputs to initiate our sampler:

# initalise the sampler:
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, args=data)

The paper says:

We run 500 steps of burn-in, followed by 5000 steps of MCMC using emcee.

First let’s run the burn-in:

# run a few samples as a burn-in:
print("Running burn-in")
p0, lnp, _ = sampler.run_mcmc(p0, 500)
sampler.reset()

print "Finished"

Now let’s run the production MCMC:

# take the highest likelihood point from the burn-in as a
# starting point and now begin your production run:
print("Running production")
p = p0[np.argmax(lnp)]
p0 = [p + 1e-5 * np.random.randn(ndim) for i in xrange(nwalkers)]
p0, lnp, _ = sampler.run_mcmc(p0, 5000)

print "Finished"

Let’s see what the posterior probability distributions look like. For this I’m using the corner library, which is pip installable.

[Note: normally I would remove the first 20\tau samples from my chains to make sure they had converged before doing this, but it doesn’t make much difference here so I’m leaving it out for now.]

I’m also going to extract the maximum likelihood values of the hyper-parameters and see how the likelihood value compares to the one from the direct optimization.

import corner

# Find the maximum likelihood values:
ml = p0[np.argmax(lnp)]
print "Maximum likelihood parameters: ",ml

MLlnB = ml[0]
MLlnC = ml[1]
MLlnL = ml[2]
MLlnP = np.log(ml[3])

p = np.array([MLlnB,MLlnC,MLlnL,MLlnP])
gp.set_parameter_vector(p)
ml_logL = gp.log_likelihood(y_train)
print "ML logL:", ml_logL

# Plot it.
figure = corner.corner(samples, labels=[r"$lnB$", r"$lnC$", r"$lnL$", r"$P$"],
                         truths=ml,
                         quantiles=[0.16,0.5,0.84],
                         levels=[0.39,0.86,0.99],
                         #levels=[0.68,0.95,0.99],
                         title="KIC 1430163",
                         show_titles=True, title_args={"fontsize": 12})

Maximum likelihood parameters: [-2.14710517 -6.13290444 0.76680865 4.02901795]
ML logL: 4426.0958393

 

figure_7

Well, I’m not getting *exactly* the same rotation period as in the paper (they’re consistent), but then again I’m also not using as much input data. Maybe that’s the reason for the discrepancy, maybe not, but this is already a loooong post so…

Then for the blog this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s