Night at the Museum II: OCR in Python

Wherein I fail to make Python read medieval Chinese calligraphy correctly, but I learn a lot about Chinese calligraphy and optical character recognition.

The Plan

It seems sensible that if we can read Chinese in our Jupyter notebook, we should try to translate the writing that’s actually on the National Palace Museum exhibits and not just limit ourselves to the meta-data information.

(If you’re already lost, you need to take a look at my previous post to find out what’s going on)

Extracting writing from images like this is known as Optical Character Recognition (OCR) and the most commonly used library for doing OCR is the tesseract library. [Insert Thor jokes here]

Optical Character Recognition (OCR) works on a premise of matching polygonal outlines of objects in images to templates. If the object is a letter then it should match a letter template. This is obviously a bit limited because text can appear in many different forms: typed, handwritten, embossed etc etc. However, in many cases it works pretty well.

Tesseract in Python

Tesseract itself is not a Python library, but there is a Python binding available – in fact there are several: pytesseract, pytesser, tesserocr

For all of them you need to have the tesseract library itself installed already, as well as its dependencies (obviously).

I’m working on MacOSX so I’ll use MacPorts to install it:

Install the tesseract library:

port install tesseract

Install the english tesseract language data:

port install tesseract-eng

Install simple chinese tesseract language data:

port install tesseract-chi-sim

Install traditional chinese tesseract language data:

port install tesseract-chi-tra

(You should also install whatever other language data you want. These are the ones I’ll be using here.)

Install the Python binding to tesseract (I’m using the tesserocr library):

pip install tesserocr
import tesserocr
# print tesseract-ocr version:
print tesserocr.tesseract_version()

tesseract 3.04.01
leptonica-1.74.1
libgif 4.2.3 : libjpeg 9b : libpng 1.6.29 : libtiff 4.0.7 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.1.2

# print tessdata path and list of available languages:
print tesserocr.get_languages()

(u'/opt/local/share/tessdata/', [u'chi_sim', u'chi_tra', u'eng'])

The first of these outputs is pretty useful because if, like me, you’ve installed tesseract using MacPorts then the default path to the tessdata folder will not be correct – it looks for /usr/local/, so you need to specify the path when you initiate the tesseract API.

Getting going in Python

from tesserocr import PyTessBaseAPI

api = PyTessBaseAPI(path='/opt/local/share/')

To run tesseract you have to specify the image you want to run it on and then identify and extract text from it. Here’s my test image:

ocr-test

imagefile = 'ocr-test.png'

api.SetImageFile(imagefile)
print api.GetUTF8Text()

The result is good, but not perfect. (But then again we don’t have all the language data for this image.)

Note that for normal font sizes tesseract only really works effectively with images that are 300 dpi or more.

Reading other languages

What about for reading Chinese language? We need to alter the API call a little:

api = PyTessBaseAPI(path='/opt/local/share/',lang='eng+chi_sim+chi_tra')

For some reason we seem to need 'eng+chi_sim' specified, neither 'chi_sim+eng' nor just 'chi_sim' seem to work…

We will also need an input file, e.g. this random image of text that I pulled off the internet as a test:


caritas


imagefile = 'caritas.png'
api.SetImageFile(imagefile)
print api.GetUTF8Text()


明爱 (伦敦〉 学院 “英国中餐文化”
口迷历史采访摘要: 黄顺有先生

Well, that seems to have worked quite nicely.

For the non-Chinese speakers, i.e. me, we can then translate the text using the method I described in an earlier post:

import mtranslate as mt

line = api.GetUTF8Text()
line_en = mt.translate(line.encode('utf-8'),"en","auto")
print line,line_en


明爱 (伦敦〉 学院 “英国中餐文化”
口迷历史采访摘要: 黄顺有先生

Caritas (London) College "British Chinese Culture" Interview with Mr. Huang Shunyou

Reading Calligraphy from the Museum

To start with, let’s select a piece of calligraphy. I’ve selected this poem on the Baotu Spring by Zhao Mengfu which is part of the Taipei National Palace Museum Open Data project.

K2B000079N000000000PAB
The Collection of the National Palace Museum

Which has this meta-data information:

filename = './2-12/K2B000079N000000000.txt'
ifile = codecs.open(filename,'r','utf-16')

while True:
    line=ifile.readline()
    if not line: break

    line_en = mt.translate(line.encode('utf-8'),"en","auto")
    print line,line_en


品名
Name
書趵突泉詩
Book Baotu Spring Poems
朝代
dynasty

yuan
作者
The author
趙孟頫
Zhao Mengfu
尺寸
size
卷 紙本 縱:33.1公分 橫:83.3公分
Volume paper vertical: 33.1 cm horizontal: 83.3 cm

I’m going to extract an excerpt from the full image.

I admit – the first time I cut out an excerpt I messed up completely. I had to go and ask some visiting summer students from Beijing University to put me on the right track.  First things first, calligraphy like this reads from right to left. The three characters on the far right of the image above are the name of the poem: Baotu Spring (趵突泉). The poem then progresses from right to left in phrases of seven characters. I’m using the first phrase as my excerpt.

I cut out my excerpt using Preview. This kind of screenshot defaults to a resolution of 72 dpi, so I manually reset that to 300 dpi in Preview. Here’s my excerpt and the binarized version of it (see below):

 

A key difference between this new input image of text and the previous test case is that the excerpt is an RGB image, whereas what we really need is a binary (black/white) image. There are two steps we need to make in order to binarize the calligraphy:

1. Convert RGB to greyscale;
2. Threshold the greyscale.

There are various ways to do this. Using the OpenCV library is one option, but just for variety I’m going to use the scikit-image library. Both are pip installable.

from skimage import io as skio
from skimage import util
import skimage

# specify the input image:
excerpt = 'excerpt.png'

# read the input image file:
img = skio.imread(excerpt)

# convert from RGB to grescale:
img = skimage.color.rgb2grey(img)

# invert the greyscale (optional):
img = util.invert(img)

Now we need to binarize the image with a threshold. For these data that’s quite straightforward:

thresh = 0.5
img[np.where(img>thresh)]=1.
img[np.where(img<thresh)]=0.

Although if we wanted/needed to do something more sophisticated there are a bunch of different approaches to thesholding. The Otsu method seems pretty popular for OCR applications.

We can then save the binarized image and pass it to tesseract and mtranslate:

outfile = 'excerpt_bin.png'
skio.imsave(outfile,img)

api.SetImageFile(outfile)
line = api.GetUTF8Text()
line_en = mt.translate(line.encode('utf-8'),"en","auto")
print line,line_en


濰水摻線腮禾卞蔘

Wei water mixed with the line Pa Gou Bian ginseng


Huh. That sounds unlikely to be correct.

Well, we can check because this is a well known poem. The actual characters of this sentence should be: 灤水發源天下無 (traditional) 滦水发源天下无 (simplified).

The only character that has been correctly identified is:

 

We should probably have expected this. Really we need to train a classifier on Meng Fu’s calligraphy directly and even then I’m not sure that artistic variations from poem to poem wouldn’t cause problems. Something to ponder another time.

As for the translation… well, this was always going to be a tricky one. Zhao Mengfu lived from 1234-1322, so his writing is like the equivalent of Chaucerian English. Even when you know the correct characters in Chinese, the properly interpreted translation is going to be difficult.

For those who are interested, the full poem reads:

趵突泉。灤水發源天下無,平地湧出白玉壺。谷虛久恐元氣洩,歲旱不愁東海枯,雲霧潤蒸華不注,波瀾聲震大名湖。時來泉上濯塵土,冰雪滿懷清興孤。右二題皆濟南近郭佳處,公瑾家故齊也,遂為書此。孟頫。

and the google translation (I’ve highlighted the excerpt sentence):

poem = 'baotu_spring.txt'
ifile = codecs.open(poem,'r','utf-8') 

while True:
    line=ifile.readline()
    if not line: break 

    line_en = mt.translate(line.encode('utf-8'),"en","zh-TW")
    print line_en
&nbsp;

Baotu Spring. Luanhe water generation in the world without, flat out of white pot. Valley virtual long fear of gas vent, the old drought do not worry about the East China Sea dry, cloud Run steamed steam is not injected, waves of sound Zhenhu Lake. When the spring on the toilet dust, snow full of Qingxingguan. Right two questions are Jinan near Guo Jia Department, Gong Jin home Qi also, then for the book this. Meng Fu.

I can’t find a properly interpreted translation of the poem online, which is a pity.

Basically this attempt was stupidly ambitious. There’s a lot of more thorough and well thought out work on OCR for calligraphy around. The Chinese Text Project is one example.

Then for the blog this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s