Wherein I fail to make Python read medieval Chinese calligraphy correctly, but I learn a lot about Chinese calligraphy and optical character recognition.
It seems sensible that if we can read Chinese in our Jupyter notebook, we should try to translate the writing that’s actually on the National Palace Museum exhibits and not just limit ourselves to the meta-data information.
(If you’re already lost, you need to take a look at my previous post to find out what’s going on)
Extracting writing from images like this is known as Optical Character Recognition (OCR) and the most commonly used library for doing OCR is the
tesseract library. [Insert Thor jokes here]
Optical Character Recognition (OCR) works on a premise of matching polygonal outlines of objects in images to templates. If the object is a letter then it should match a letter template. This is obviously a bit limited because text can appear in many different forms: typed, handwritten, embossed etc etc. However, in many cases it works pretty well.
Tesseract in Python
Tesseract itself is not a Python library, but there is a Python binding available – in fact there are several: pytesseract, pytesser, tesserocr …
For all of them you need to have the tesseract library itself installed already, as well as its dependencies (obviously).
I’m working on MacOSX so I’ll use MacPorts to install it:
Install the tesseract library:
port install tesseract
Install the english tesseract language data:
port install tesseract-eng
Install simple chinese tesseract language data:
port install tesseract-chi-sim
Install traditional chinese tesseract language data:
port install tesseract-chi-tra
(You should also install whatever other language data you want. These are the ones I’ll be using here.)
Install the Python binding to tesseract (I’m using the tesserocr library):
pip install tesserocr
# print tesseract-ocr version: print tesserocr.tesseract_version()
libgif 4.2.3 : libjpeg 9b : libpng 1.6.29 : libtiff 4.0.7 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.1.2
# print tessdata path and list of available languages: print tesserocr.get_languages()
(u'/opt/local/share/tessdata/', [u'chi_sim', u'chi_tra', u'eng'])
The first of these outputs is pretty useful because if, like me, you’ve installed
tesseract using MacPorts then the default path to the tessdata folder will not be correct – it looks for
/usr/local/, so you need to specify the path when you initiate the
Getting going in Python
from tesserocr import PyTessBaseAPI api = PyTessBaseAPI(path='/opt/local/share/')
To run tesseract you have to specify the image you want to run it on and then identify and extract text from it. Here’s my test image:
imagefile = 'ocr-test.png' api.SetImageFile(imagefile) print api.GetUTF8Text()
The result is good, but not perfect. (But then again we don’t have all the language data for this image.)
Note that for normal font sizes
tesseract only really works effectively with images that are 300 dpi or more.
Reading other languages
What about for reading Chinese language? We need to alter the API call a little:
api = PyTessBaseAPI(path='/opt/local/share/',lang='eng+chi_sim+chi_tra')
For some reason we seem to need
'eng+chi_sim' specified, neither
'chi_sim+eng' nor just
'chi_sim' seem to work…
We will also need an input file, e.g. this random image of text that I pulled off the internet as a test:
imagefile = 'caritas.png' api.SetImageFile(imagefile) print api.GetUTF8Text()
明爱 (伦敦〉 学院 “英国中餐文化”
Well, that seems to have worked quite nicely.
For the non-Chinese speakers, i.e. me, we can then translate the text using the method I described in an earlier post:
import mtranslate as mt line = api.GetUTF8Text() line_en = mt.translate(line.encode('utf-8'),"en","auto") print line,line_en
明爱 (伦敦〉 学院 “英国中餐文化”
Caritas (London) College "British Chinese Culture" Interview with Mr. Huang Shunyou
Reading Calligraphy from the Museum
Which has this meta-data information:
filename = './2-12/K2B000079N000000000.txt' ifile = codecs.open(filename,'r','utf-16') while True: line=ifile.readline() if not line: break line_en = mt.translate(line.encode('utf-8'),"en","auto") print line,line_en
Book Baotu Spring Poems
卷 紙本 縱：33.1公分 橫：83.3公分
Volume paper vertical: 33.1 cm horizontal: 83.3 cm
I’m going to extract an excerpt from the full image.
I admit – the first time I cut out an excerpt I messed up completely. I had to go and ask some visiting summer students from Beijing University to put me on the right track. First things first, calligraphy like this reads from right to left. The three characters on the far right of the image above are the name of the poem: Baotu Spring (趵突泉). The poem then progresses from right to left in phrases of seven characters. I’m using the first phrase as my excerpt.
I cut out my excerpt using Preview. This kind of screenshot defaults to a resolution of 72 dpi, so I manually reset that to 300 dpi in Preview. Here’s my excerpt and the binarized version of it (see below):
A key difference between this new input image of text and the previous test case is that the excerpt is an RGB image, whereas what we really need is a binary (black/white) image. There are two steps we need to make in order to binarize the calligraphy:
1. Convert RGB to greyscale;
2. Threshold the greyscale.
from skimage import io as skio from skimage import util import skimage # specify the input image: excerpt = 'excerpt.png' # read the input image file: img = skio.imread(excerpt) # convert from RGB to grescale: img = skimage.color.rgb2grey(img) # invert the greyscale (optional): img = util.invert(img)
Now we need to binarize the image with a threshold. For these data that’s quite straightforward:
thresh = 0.5 img[np.where(img>thresh)]=1. img[np.where(img<thresh)]=0.
Although if we wanted/needed to do something more sophisticated there are a bunch of different approaches to thesholding. The Otsu method seems pretty popular for OCR applications.
We can then save the binarized image and pass it to
outfile = 'excerpt_bin.png' skio.imsave(outfile,img) api.SetImageFile(outfile) line = api.GetUTF8Text() line_en = mt.translate(line.encode('utf-8'),"en","auto") print line,line_en
Wei water mixed with the line Pa Gou Bian ginseng
Huh. That sounds unlikely to be correct.
Well, we can check because this is a well known poem. The actual characters of this sentence should be: 灤水發源天下無 (traditional) 滦水发源天下无 (simplified).
The only character that has been correctly identified is:
We should probably have expected this. Really we need to train a classifier on Meng Fu’s calligraphy directly and even then I’m not sure that artistic variations from poem to poem wouldn’t cause problems. Something to ponder another time.
As for the translation… well, this was always going to be a tricky one. Zhao Mengfu lived from 1234-1322, so his writing is like the equivalent of Chaucerian English. Even when you know the correct characters in Chinese, the properly interpreted translation is going to be difficult.
For those who are interested, the full poem reads:
and the google translation (I’ve highlighted the excerpt sentence):
poem = 'baotu_spring.txt' ifile = codecs.open(poem,'r','utf-8') while True: line=ifile.readline() if not line: break line_en = mt.translate(line.encode('utf-8'),"en","zh-TW") print line_en
Baotu Spring. Luanhe water generation in the world without, flat out of white pot. Valley virtual long fear of gas vent, the old drought do not worry about the East China Sea dry, cloud Run steamed steam is not injected, waves of sound Zhenhu Lake. When the spring on the toilet dust, snow full of Qingxingguan. Right two questions are Jinan near Guo Jia Department, Gong Jin home Qi also, then for the book this. Meng Fu.
I can’t find a properly interpreted translation of the poem online, which is a pity.
Basically this attempt was stupidly ambitious. There’s a lot of more thorough and well thought out work on OCR for calligraphy around. The Chinese Text Project is one example.
Then for the blog this.