I suck at Chinese. If someone asks me to read a Chinese story, I can’t do it unless it’s extremely easy or has pinyin on top of it. So what do you do when you find a block of text that you can’t read?
“Ah, just copy that into Google Translate! It shows you the pinyin.”
What if you have a textbook with non-selectable Chinese characters? What about images taken on your phone?
“Well, you could draw characters into Google Translate one-by-one.”
But this is the year 2020 and we’re not savages drawing cave art on the walls. Thankfully, Google has a neat solution: an image OCR library designed for this type of task. Meet
Let’s get started.
If you’re on Windows,
tesseract is available in Cygwin, or as a compiled executable from UB-Mannheim. You choose, but quite honestly I don’t like Cygwin because it is quite bulky to download and install.
If you’re on a Mac, it’s pretty easy! Run:
You do have Homebrew installed… right? Well, if you don’t, it’s a one-liner install!
You may also need to download language packs. For
tesseract, yes, it doesn’t detect Chinese out of the box. But fear not! Homebrew has, once again, the solution:
On Linux most package managers should carry
tesseract. If you’re on Ubuntu, it’s literally:
If you are on any other platform, you will have to install
tesseract manually, and then copy over the language packs from here. Again, you might want to check out the full instructions from the official website.
Done? All right! Now onto the recognition!
Here is a sample file:
It can be a simple screenshot. It doesn’t matter. The point is, the characters must be crystal-clear, or
tesseract may fail to recognize them. The best way of obtaining crystal-clear screenshots is to just clip from the textbook PDF.
Note that image recognition software works best with black-and-white images (I know, the algorithm is incredibly racist). Just kidding. Color data in image introduces variance that image recognition algorithms have a hard time with. For best results, pop into Photoshop, turn down the saturation to 0 (or negative values) so that the image appears grayscale.
Here, we give it the image as
input.img, and specify the output filename. It will always be a text file.
-l chi_sim is the language denominator. You can change this to change the language recognized. For example, if I wanted Korean, I would change this to
tesseract will think for a second and then produce the output in a text file, named
Cool! Now we just paste it into Google Translate and…
That does look quite nice to read now. 😀 Hope this helps!
Update – I’ve found this site that helps with pinyin display. Check it out!