According to the article, the museum has put images and meta-data for 70,000 items online. So what do you get if you download the information on a particular item?
In the Box
For most items the download is a catalogue directory item, which contains four files:
1. A .jpg image file
2. A .txt meta-data file
The last two of these are weird Windows things associated with webpage thumbnail images. I’m going to ignore them for now and just discuss the first two.
[Note: for some items the download is a directory containing a JSON meta-data file and an image instead of a .txt file and an image. The same basic principles apply as below.]
The .jpg file should be obvious – it’s an image of the catalogue item. It will look something like this:
The .txt file is the meta-data to go with the image. If you open it up in a text editor, the contents will look something like this:
西漢 206B.C.E. ~A.D.9
Well, if you’re as useless as me at languages this is probably also a bit of a hurdle for you. Fortunately Python (and the internet) are pretty good at languages.
Reading the Meta-data
First things first though, we’re not only talking about human languages here. We also have to deal with the encoding of the Chinese characters.
The characters in this file are encoded in UTF-16. So if we just use
open() to read the file we’ll get garbled crap. Instead we can use the codecs library to open the file.
It’s pip installable:
pip install codecs
and we can tell it to read the UTF-16 encoded characters:
import codecs filename = './5-03/K1D001175N000000000.txt' ifile = codecs.open(filename,'r','utf-16')
But that only gets me part of the way. I now need to translate the Chinese into English and when I want to translate things I generally turn to Google. Unfortunately there is no official free API for Google translate, Python or otherwise. But that doesn’t mean that people haven’t found work arounds.
Translation in Python
I’m using the mtranslate library. It’s basically a wrapper around
urllib to query the Google translate webform and it seems to work really well. It’s also pip installable.
It has one function:
translate(), which takes three arguments:
(text_to_translate, to_language, from_language). Either the “to” language or the “from” language can be set to “auto”. This will default to English for the “to” language, but automatically detect the “from” language. Otherwise you need to specify these using two letter abbreviations, i.e. “en”, “fr” = French, “es” = Espanol, etc. Simple Chinese has the code
"zh-CN", and there’s a full list of abbreviations here.
The only thing to be careful of is that it requires the input to be UTF-8 encoded, so when we call the
translate() function we need to re-encode the data from the meta-data file. Running it should look something like this:
import mtranslate as mt
while True: line=ifile.readline() if not line: break line_en = mt.translate(line.encode('utf-8'),"en","auto") print line,line_en
Black pot cocoon pot
西漢 206B.C.E. ~A.D.9
Western Han Dynasty 206B.C.E. ~ A.D.9
High 32 cm, abdominal circumference 98.8 cm