Night at the Museum: Translation in Python

I love Taipei. I also love Open Data. So I was very happy to read that the National Palace Museum in Taipei had an open data project.

According to the article, the museum has put images and meta-data for 70,000 items online. So what do you get if you download the information on a particular item?

In the Box

For most items the download is a catalogue directory item, which contains four files:

1. A .jpg image file
2. A .txt meta-data file
3. photothumb.db
4. Thumbs.db

The last two of these are weird Windows things associated with webpage thumbnail images. I’m going to ignore them for now and just discuss the first two.

[Note: for some items the download is a directory containing a JSON meta-data file and an image instead of a .txt file and an image. The same basic principles apply as below.]

The .jpg file should be obvious – it’s an image of the catalogue item. It will look something like this:

K1D001175N000000000PAB
The Collection of the National Palace Museum

The .txt file is the meta-data to go with the image. If you open it up in a text editor, the contents will look something like this:


品名
黑陶繭式壺
朝代
西漢 206B.C.E. ~A.D.9
作者

尺寸
高32公分,腹圍98.8公分


Ah ha.

Well, if you’re as useless as me at languages this is probably also a bit of a hurdle for you. Fortunately Python (and the internet) are pretty good at languages.

Reading the Meta-data

First things first though, we’re not only talking about human languages here. We also have to deal with the encoding of the Chinese characters.

The characters in this file are encoded in UTF-16. So if we just use open() to read the file we’ll get garbled crap. Instead we can use the codecs library to open the file.

It’s pip installable:

pip install codecs

and we can tell it to read the UTF-16 encoded characters:

import codecs

filename = './5-03/K1D001175N000000000.txt'
ifile = codecs.open(filename,'r','utf-16')

But that only gets me part of the way. I now need to translate the Chinese into English and when I want to translate things I generally turn to Google. Unfortunately there is no official free API for Google translate, Python or otherwise. But that doesn’t mean that people haven’t found work arounds.

Translation in Python

I’m using the mtranslate library. It’s basically a wrapper around urllib to query the Google translate webform and it seems to work really well. It’s also pip installable.

It has one function: translate(), which takes three arguments: (text_to_translate, to_language, from_language). Either the “to” language or the “from” language can be set to “auto”. This will default to English for the “to” language, but automatically detect the “from” language. Otherwise you need to specify these using two letter abbreviations, i.e. “en”, “fr” = French, “es” = Espanol, etc. Simple Chinese has the code "zh-CN", and there’s a full list of abbreviations here.

The only thing to be careful of is that it requires the input to be UTF-8 encoded, so when we call the translate() function we need to re-encode the data from the meta-data file. Running it should look something like this:

import mtranslate as mt
while True:
    line=ifile.readline()
    if not line: break

    line_en = mt.translate(line.encode('utf-8'),"en","auto")
    print line,line_en


品名
Name
黑陶繭式壺
Black pot cocoon pot
朝代
dynasty
西漢 206B.C.E. ~A.D.9
Western Han Dynasty 206B.C.E. ~ A.D.9
作者
The author

尺寸
size
高32公分,腹圍98.8公分
High 32 cm, abdominal circumference 98.8 cm


Simples.

One Reply to “Night at the Museum: Translation in Python”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: