HTRU1 – Creating the PyTorch Dataset

In my previous post I talked about how to use random forest classification to separate true pulsar candidates from RFI. That classification used numerical features extracted from the processed data.

Ultimately it would be interesting to just be able to use the data itself, rather than extracted features. To help me do that I’ve created a PyTorch torchvision dataset in the same format as the well-known CIFAR10 dataset, so that I can easily load and manipulate those data for use with image-based classifiers (e.g. CNNs).

Here I’m going to explain how I did that, just in case anyone else wants to do the same. If you want to use my dataset you can find it here.

The Set-Up

I have two classes: pulsar and non-pulsar. I’ve got 1196 PNG files of pulsar data and 58804 PNG files of non-pulsar data, which I’ve split between two different directories.

Just like CIFAR10 I’m specifying that I want 6 batches of data: 5 for training and 1 for testing, each of which will contain 10000 samples. So I’ll have 60000 samples in total.

The data array for each PNG is a flattened array with dimensions of npix x pix x nchan, where npix is the number of pixels on the side of an image (my images are 32 x 32) and nchan is the number of channels in the image. For an RGB file this is 3.

# ------------------------------------
# User defined variables:

class0dir = "/path/to/HTRU1/PYTORCH/JPG/pulsar/"
class1dir = "/path/to/HTRU1/PYTORCH/JPG/nonpulsar/"

bindir = "/path/to/HTRU1/PYTORCH/JPG/htru1-batches-py/"

# this number should include the training batches PLUS the test batch:
nbatch = 6

# total number of samples per batch:
pbatch = 10000

# label names:
label_names = ['pulsar','nonpulsar']

# length of data arrays [npix x npix x rgb = 32 x 32 x 3 = 3072]
nvis = 3072

I’m using the os.listdir function to get a list of all the filenames in each class directory:

# ------------------------------------
# get lists of files for different classes:

cl0_files = np.array(os.listdir(class0dir))
cl1_files = np.array(os.listdir(class1dir))

That list will also contain the .DS_Store folder attributes file, which we don’t want to pass to the classifier so let’s remove it from the list:

# ------------------------------------
# remove folder attributes file from list:

idx = np.where(cl0_files=='.DS_Store')
if idx!=None:
    cl0_files = np.delete(cl0_files,idx)

idx = np.where(cl1_files=='.DS_Store')
if idx!=None:
    cl1_files = np.delete(cl1_files,idx)

Now a quick check to see how many samples/files we have in each class:

# ------------------------------------
# count up number of files:

n_cl0 = len(cl0_files)
n_cl1 = len(cl1_files)

We’re going to use this to make sure that we have sufficient data available to fill all the batches we’ve specified:

# ------------------------------------
# check that there are enough samples for the batches:

assert ((n_cl0+n_cl1)>(pbatch*nbatch)),'Not enough samples available to fill '+str(nbatch)+' batches of '+str(pbatch)

I actually have way more data samples than I need to fill all the batches, but I also have a big class imbalance in those data. I want to make sure that my batched dataset includes all the examples of the minority class and I’ll fill up the remaining slots with the majority class:

# ------------------------------------
# check for the minority class and make sure all the data is included:

if (n_cl1>n_cl0):
    n_batch_cl0 = np.floor(n_cl0/nbatch)
    n_batch_cl1 = pbatch - n_batch_cl0
else:
    n_batch_cl1 = floor(n_cl1/nbatch)
    n_batch_cl0 = pbatch - n_batch_cl1

We can then loop through and fill the separate batches. I’m randomising the samples in each batch using a little function I wrote called randomise_by_index.py. This isn’t totally necessary because you can randomise the order using the torchvision DataLoader when you read it into your classifier.

# ------------------------------------
# loop through and fill the batches:

for batch in range(nbatch):

    if (batch==(nbatch-1)):
        # the last batch is the test batch:
        oname = "test_batch"
        batch_label = 'testing batch 1 of 1'
    else:
        # everything else is a training batch:
        oname = "data_batch_"+str(batch+1)
        batch_label = 'training batch '+str(batch+1)+' of '+str(nbatch-1)

    # create empty arrays for the batches:
    labels=[];filedata=[];data=[];filenames=[]

    # get pulsars:
    for i in range(int(batch*n_batch_cl0),int((batch+1)*n_batch_cl0)):

        filename = cl0_files[i]
        filenames.append(filename)

        labels.append(0)

        im = Image.open(class0dir+filename)
        im = (np.array(im))
        r = im[:,:,0].flatten()
        g = im[:,:,1].flatten()
        b = im[:,:,2].flatten()
        filedata = np.array(list(r) + list(g) + list(b),np.uint8)
        data.append(filedata)

    # get non-pulsars:
    for i in range(int(batch*n_batch_cl1),int((batch+1)*n_batch_cl1)):

        filename = cl1_files[i]
        filenames.append(filename)

        labels.append(1)

        im = Image.open(class1dir+filename)
        im = (np.array(im))
        r = im[:,:,0].flatten()
        g = im[:,:,1].flatten()
        b = im[:,:,2].flatten()
        filedata = np.array(list(r) + list(g) + list(b),np.uint8)
        data.append(filedata)

    # randomise data in batch:
    idx_list = range(0,pbatch)
    labels = randomise_by_index(labels,idx_list)
    data = randomise_by_index(data,idx_list)
    filenames = randomise_by_index(filenames,idx_list)

    # create dictionary of batch:
    dict = {
            'batch_label':batch_label,
            'labels':labels,
            'data':data,
            'filenames':filenames
            }

    # write pickled output:
    with io.open(bindir+oname, 'wb') as f:
        pickle.dump(dict, f)

# end batch loop

Once this has run, you’ll need to tar and zip the output directory:

> tar -cvzf htru1-batches-python.tar.gz htru1-batches-py

You’ll then need to put the zipped file somewhere that it can be publicly accessed online and make a note of the url.

Making a torchvision class

Because the dataset is in the same format as the CIFAR10 dataset, I can create a child class of torchvision.datasets that mimics CIFAR10. The only thing we need to change is the header info in the file:


base_folder = 'htru1-batches-py'
url = "http://www.mywebsite.co.uk/path/to/htru1-batches-py.tar.gz"
filename = "htru1-batches-py.tar.gz"
tgz_md5 = 'b0669225acbe64b04912e6c7cc0d32a4'
train_list = [
        ['data_batch_1', '69274724bcae2ae4d3a6282085cfb90d'],
        ['data_batch_2', 'ab2c2a2e4d581699acba662d40b8e8f3'],
        ['data_batch_3', 'f3f43b3a3409ebf3fbdda7bc38a1f63b'],
        ['data_batch_4', '4670b7c26353669ee8db746b968878a5'],
        ['data_batch_5', '25cda46c373001e9c68e1a7d77ed34ef'],
             ]

test_list = [
        ['test_batch', '1ee87ef6a9379ec544b57f2d890d5132'],
            ]

meta = {
        'filename': 'batches.meta',
        'key': 'label_names',
        'md5': '869286cd7ec19f56632e7d6a11742248',
       }

You can see my full HTRU1 class file here.

The long numbers after the file names are the checksums. They are there to verify that the data files have been dowloaded correctly and haven’t been corrupted. You can generate these on the commandline using md5sum, e.g.:

> md5sum htru1-batches-py.tar.gz

…and that’s it. You should now be able to import the new dataset directly into a pytorch classifier. For HTRU1 you can find the instructions and an example classification script on GitHub here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s