Friends of Fiends

It’s not a typo. Well it is, but this time it’s deliberate. I mistype friends-of-friends so often that I’ve decided to just give in and call my algorithm “friends-of-fiends” (FOF) instead.

Problem statement:  Student X is analysing data towards a galaxy cluster that doesn’t have a known redshift, i.e. a known distance away from us. How do we use existing optical data to establish this value?

Solution: The SDSS survey has provided a fantastic database of individual galaxies with redshift information. If we can identify which of these galaxies are in our cluster then we will know the redshift.

The only thing standing in our way is working out how to identify which galaxies are in our cluster.

When I was an undergraduate I did a project using a FOF algorithm to work out which galaxies were in the Coma cluster. It’s been a while, but I decided to take a little trip down memory lane and revisit the method.

Getting Data

You can query the SDSS database server using SQL. For example, if we wanted to extract the Right Ascension, Declination and redshift for all the galaxies within 20 arcminutes of the centre of the well studied galaxy cluster Abell 795, we could use the query:

-- This query does a table JOIN between the galaxy catalogue (Galaxy) and
-- photometric redshift (Photoz) tables to extract all objects that are in both
-- catalogues in a radius of 20 arcminutes around position RA:141.010 Dec:14.168
-- with redshifts between z=0 and z=0.6 and
-- with a photometric redshift classification between -1 and +3
SELECT gal.ra, gal.dec, pz.z, pz.zErr
from Galaxy as gal, Photoz as pz,
dbo.fGetNearbyObjEq(141.010,14.168,20) as N
where N.objID =gal.objID and gal.objID=pz.objID
and pz.z>=0.0 and pz.z<=0.6
and photoerrorclass<=3 and photoerrorclass>=-1

We can store the output in a simple text file (mine is called A795_sdss.dat) and then analyse it in Python.


First we’ll need some libraries…

import numpy as np              # for array manipulation
import matplotlib.pyplot as pl  # for plotting stuff
from mayavi import mlab         # optional
from scipy import stats         # optional

To read the data we extracted from the SDSS database, something along these lines is probably ok:

def read_sdss(filename):

    datafile = open(filename,'r')

    while True:
        if not line: break

        # remove header info:
        if line[0]!='#':

            # ignore objects with no redshift info:
            if items[3]!='-9999':


    sdss_data = np.vstack((srcid,ra,dec,z))

    return sdss_data

To run this on our output data file:

sdss_data = read_sdss("A795_sdss.dat")
print "Number of objects: ",sdss_data.shape[1]

This should tell you that there are 1913 objects in our catalogue.

Viewing Data

There are a bunch of different ways that you could view these data in Python. Here I’ve chosen to play with the mayavi library. (Here are my notes on installing mayavi.)

xyz = sdss_data

x = sdss_data[1,:]
y = sdss_data[2,:]
z = sdss_data[3,:]

# find the local density of points:
kde = stats.gaussian_kde(xyz)
density = kde(xyz)

# Plot 3D scatter with mayavi
figure = mlab.figure('DensityPlot')

# use the local density as a colour scale:
pts = mlab.points3d(x, y, z, density, scale_mode='none', scale_factor=0.01)




This is all very well, but doesn’t really tell us which galaxies are associated into a particular group. That’s where FOF comes in.

FOF uses a quantity called a linking length to determine if objects are associated. It’s not specific to galaxies. There is a Python library available called pyfof which allows you to run a basic FOF algorithm, but it wasn’t suitable for this application. (However, if you want to try it, you might be interested in my notes on installing pyfof.)

For a set of data FOF takes an initial data point and then looks to see if there are other data points within a specified distance (linking length) of that initial point. In Step 1 above the red datum is the initial data point and the dashed circle represents the area enclosed by a particular radial linking length. The two blue data points are within the linking length, so they are considered to be associated to the initial point. We then repeat the process with the two blue points. Any other data point that is within a linking length distance of those points is also associated into the same group (coloured red), and so on (Step 2). Eventually we’ll reach a point where there are no further data points within a linking length of any group members (Step 3). We could then re-start the process using the black data point, in order to find a second group.

Real Data

For the galaxies in our SDSS catalogue we have three axes, and a further problem is that separations in each dimension do not map linearly to each other because of (1) projection in astronomical co-ordinate systems & (2) the fact that the third axis is really a velocity rather than a distance. So the single linking length as described above – which is implemented in pyfof – won’t cut it for these data. Instead we need to define a 3-dimensional linking volume.

Linking Volumes

Fortunately someone has already done the work for us on this. In fact several people – and they don’t always agree. There are a range of different linking lengths reviewed here, but I’m choosing to use the value from this paper, which I’ve tested out on SDSS data for a handful of my favourite galaxy clusters and seems to work pretty reliably.

This linking length is redshift dependent, so it varies depending on where you are in your data volume:

d_{\rm LL} = 0.25\,\left[1 + \arctan{(z/0.05)} \right]~h^{-1}\,{\rm Mpc}

This value, d_{\rm LL}, represents the mean diameter of a cylinder containing a galaxy and it’s nearest neighbour:

Left: Volume view showing radial offset. Right: Plane of sky view showing transverse offset.

It’s described in more detail here. However, it’s unclear to me from the text of the paper whether we should consider galaxies within the linking length or within half the linking length as being associated. I’ve gone for within a full linking length based on this comment:

“The smallest cylinder that contains two galaxies defines the minimum distance between the two galaxies”

and the fact that the paper doesn’t state that the initial data point / galaxy must be at the centre of the cylinder face. You might want to add in a factor of two if you disagree.

Coding it up in Python looks something like this:

def calc_tempel(sdss1):

    Average distance to nearest neighbour
    in units of h^-1 Mpc

    sdss1 is an array [srcid, ra, dec, z]

    # Tempel et al. 2012:
    d_ll0 = 0.25   # h^-1 Mpc
    a_star = 1.0
    z_star = 0.05

    # extract redshift of galaxy:
    z = sdss1[3]

    # calculate expected average nearest neighbour distance
    # at this redshift:
    d_ll = d_ll0*(1 + a_star*np.arctan(z/z_star))

    return d_ll

OK, now we need to go through all the pairs of galaxies in our catalogue and check whether they are within our linking volume. To do this we need to calculate their transverse (in the plane of the sky) separation and their radial (along the line of sight) separation.

Because of the equatorial projection we need to calculate the separation in Right Ascension as

\Delta RA_{ij} = \cos{\delta_i} \left(RA_i - RA_j \right)

and the total separation as

\Delta \theta_{ij} = \sqrt{(\Delta RA_{ij})^2+(\Delta \delta_{ij})^2}.

The physical separation is then given by \Delta \theta_{ij} \times D_{\rm A}, where D_{\rm A} is the cosmological angular diameter distance, which is a function of redshift.

Or if we want to be super fancy we can use the expression from this paper:

d_{\rm \perp} = \frac{c}{H_0}(z_i + z_j) \sin\left( \frac{\Delta \theta_{ij}}{2} \right).

Here (and everywhere else) I’m going to take:

\frac{c}{H_0} = 3000\,h^{-1}\,{\rm Mpc},

which is the Hubble distance.

Coding this up in Python looks like this:

def calc_trans(sdss1,sdss2):

    returns plane of sky separation
    in units of h^-1 Mpc

    sdss1 is an array [srcid, ra, dec, z]

    const_cH0 = 3e3  # c/H0: h^-1 Mpc
    deg2rad = np.pi/180.

    z1 = sdss1[3]; z2 = sdss2[3]
    delta_dec = sdss1[2] - sdss2[2]
    delta_ra = np.cos(sdss1[2]*deg2rad)*(sdss1[1] - sdss2[1])
    delta_theta = np.sqrt(delta_dec**2+delta_ra**2)

    sep_trans = (const_cH0)*(z1+z2)*np.sin(0.5*delta_theta*deg2rad)

    return sep_trans

In the radial (line of sight) direction, we can use a simple linear approximation to cosmological distance at low (<0.5) redshifts:

d_{\rm los} = \frac{c}{H_0}z

def calc_radial(sdss1,sdss2):

    returns line of sight separation
    in units of h^-1 Mpc

    sdss1 is an array [srcid, ra, dec, z]

    const_cH0 = 3e3 # c/H0: h^-1 Mpc    

    z1 = sdss1[3]; z2 = sdss2[3]

    sep_radial = (const_cH0)*np.abs(z1-z2)

    return sep_radial

Putting it Together

Right, now we’ve got all the pieces, let’s put them to use. I’m not going to split this up much but hopefully the comments make sense.

The one key thing to notice is the different linking length constraint in the transverse (\perp; plane of sky) dimension and in the radial (line of sight) dimension. Due to an effect known as the finger of god, groups and clusters of galaxies are elongated in the radial direction. Exactly what the proportionality of this elongation is is a matter of debate, but I’m using a pretty standard value of 10, i.e. they are elongated in the radial direction 10 times more than in the transverse direction.

# don't overwrite the input data...
tmp_sdss = sdss_data

# initialise a list for the output
# and start a counter for the number of groups found:
grp = 0

# initialise the maximum number of galaxies found in a group:
n_max = 0

# set constraints in terms of linking length
# 1 for the transverse direction; 10 for the radial direction:
b_trans = 1.
b_los = 10.

# start the loop:
while True:

    # check we haven't got zero data:
    if (tmp_sdss.shape[1]<1): break

    # pick an initial data point:
    sdss1 = tmp_sdss[:,0]

    # calculate the separation of the initial point
    # from every other data point in both dimensions:
    sep_sky = calc_trans(sdss1,tmp_sdss)
    sep_los = calc_radial(sdss1,tmp_sdss)

    # find the linking length at the redshift of
    # the initial data point:
    d_ll_sky = calc_tempel(sdss1)
    d_ll_los = d_ll_sky

    # normalise the separation:
    tmp_sep1 = sep_sky/d_ll_sky
    tmp_sep2 = sep_los/d_ll_los

    # identify data points within the linking length:
    grp_indices = np.where( np.logical_and( tmp_sep1<=b_trans, tmp_sep2<=b_los))     # identify data points outside the linking length:     other_indices = np.where( np.logical_or( tmp_sep1>b_trans, tmp_sep2>b_los) )

    # put the group members in their own array:
    tmp_grp = np.squeeze(tmp_sdss[:,grp_indices],axis=(1,))
    # remove the group members from the rest of the data:
    tmp_sdss = np.squeeze(tmp_sdss[:,other_indices],axis=(1,))

    # initialise this new group:
    gpt = 0
    group = tmp_grp 

    # loop through the linking length catchment of all
    # group members:
    while True:

        # how many group members initially:
        n_mem = group.shape[1]

        # step through them all updating the group:
        for gal in range(gpt,n_mem):

            sep_sky = calc_trans(group[:,gal],tmp_sdss)
            sep_los = calc_radial(group[:,gal],tmp_sdss)

            d_ll_sky = calc_tempel(group[:,gal])
            d_ll_los = d_ll_sky

            tmp_sep1 = sep_sky/d_ll_sky
            tmp_sep2 = sep_los/d_ll_los

            grp_indices = np.where( np.logical_and( tmp_sep1<=b_trans, tmp_sep2<=b_los) )             other_indices = np.where( np.logical_or( tmp_sep1>b_trans, tmp_sep2>b_los) )

            tmp_grp = np.squeeze(tmp_sdss[:,grp_indices],axis=(1,))
            tmp_sdss = np.squeeze(tmp_sdss[:,other_indices],axis=(1,))

            if (tmp_grp.shape[1]>0):

                #if there are more updates add them in:
                group = np.hstack((group,tmp_grp))

                #if there are no new members do nothing:

        if (group.shape[1]>n_mem):
            # if we've added new members then we need to check
            # their catchment volumes too:
            gpt = n_mem
            # if we've stopped adding new members then we can stop:
            if (group.shape[1]>10):
                # print out when we find a big group:
                print "Group ",grp," has", group.shape[1]," members"

            if (group.shape[1]>n_max):
                # check if this is the biggest group so far...
                n_max = group.shape[1]

            for i in range(0,group.shape[1]):
                # add the completed group into our master list
                # of groups:

            # update the group counter:

print "There are ",grp," groups"
print "The largest group is ", big_grp," with ",n_max," members"

Let’s see where these galaxies are in our data:

a_grp = big_grp

ax = pl.subplot(111)



…and let’s calculate the mean position of the cluster in all dimensions:

ra_cen = np.mean(groups[np.where(groups[:,4]==a_grp),1])
dec_cen = np.mean(groups[np.where(groups[:,4]==a_grp),2])
z_cen = np.mean(groups[np.where(groups[:,4]==a_grp),3])

print "Mean redshift: ",z_cen

Mean redshift: 0.142555448718

So how did we do? Well, the galaxy population of Abell 795 has been studied in detail here. In this paper they use a more accurate method of obtaining redshifts (spectroscopic) than we have in the survey data (photometric).

They found a redshift for Abell 795 of z = 0.1374, so we’re pretty close.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: