I suck at Chinese. If someone asks me to read a Chinese story, I can’t do it unless it’s extremely easy or has pinyin on top of it. So what do you do when you find a block of text that you can’t read?

“Ah, just copy that into Google Translate! It shows you the pinyin.”

What if you have a textbook with non-selectable Chinese characters? What about images taken on your phone?

“Well, you could draw characters into Google Translate one-by-one.”

But this is the year 2020 and we’re not savages drawing cave art on the walls. Thankfully, Google has a neat solution: an image OCR library designed for this type of task. Meet tesseract.

Let’s get started.

Installation

If you’re on Windows, tesseract is available in Cygwin, or as a compiled executable from UB-Mannheim. You choose, but quite honestly I don’t like Cygwin because it is quite bulky to download and install.

If you’re on a Mac, it’s pretty easy! Run:

brew install tesseract

You do have Homebrew installed… right? Well, if you don’t, it’s a one-liner install!

You may also need to download language packs. For tesseract, yes, it doesn’t detect Chinese out of the box. But fear not! Homebrew has, once again, the solution:

brew install tesseract-lang

On Linux most package managers should carry tesseract. If you’re on Ubuntu, it’s literally:

sudo apt install tesseract-ocr tesseract-ocr-eng tesseract-ocr-chi-sim

If you are on any other platform, you will have to install tesseract manually, and then copy over the language packs from here. Again, you might want to check out the full instructions from the official website.

Done? All right! Now onto the recognition!

Recognition

Here is a sample file:

sample-file

It can be a simple screenshot. It doesn’t matter. The point is, the characters must be crystal-clear, or tesseract may fail to recognize them. The best way of obtaining crystal-clear screenshots is to just clip from the textbook PDF.

Note that image recognition software works best with black-and-white images (I know, the algorithm is incredibly racist). Just kidding. Color data in image introduces variance that image recognition algorithms have a hard time with. For best results, pop into Photoshop, turn down the saturation to 0 (or negative values) so that the image appears grayscale.

Now, run:

tesseract input.png output -l chi_sim

Here, we give it the image as input.img, and specify the output filename. It will always be a text file. -l chi_sim is the language denominator. You can change this to change the language recognized. For example, if I wanted Korean, I would change this to -l kor.

tesseract will think for a second and then produce the output in a text file, named output.txt:

[强调肯定 _Emphasizing an affirmation]
A: 今年的花儿没有去年开得好。
B: 今年的花儿是没有去年开得好。可能你浇水浇多了。
A: 我是浇得多了点儿。可是君子兰开花不是开得很好吗?

Cool! Now we just paste it into Google Translate and…

google-translate-pinyin

That does look quite nice to read now. 😀 Hope this helps!

Update – I’ve found this site that helps with pinyin display. Check it out!