(c) 2006 Edward H. Trager

main >> blogs >> 2007.02.27

Web-based Language Assistants

Delusional Obsessions with Machine Translation

Have you ever used a machine-based translation service such as Babelfish or World Lingo to assist you in understanding a foreign-language web page?

If you have, which of the following occurred when you saw the translation?:

I'll bet you dollars to donuts the answer was not D.

If you have never used such a service before, Here's a good example of what happens. Below is a short paragraph in Chinese about Pu Songling, a famous writer who lived during the transitional period at the end of the Ming and the beginning of the Qing dynasties in China [蒲松龄与《聊斋志异》 (www.ccnt.com.cn)]. Pu Songling's magnum opus is 聊齋誌異 liaó zhaī zhì yì, Strange Stories from a Chinese Studio. This paragraph sketches his life from birth in a small land-holding merchant family, to his lack of success in the civil service system, to his retreat in his studio where he wrote his famous work:

蒲松龄(1640~1715),字留仙,山东淄川人。生活在民族矛盾和阶级矛盾空前尖锐的明末清初。出身小地主小商人家庭,在科举场中很不得意,满腹实学,屡不中举,到了71岁,才考得了贡生。他牢骚满腹,便在聊斋写他的志异。

Written in modern standard Mandarin, there is nothing abstruse or difficult about this paragraph if you know Chinese. Now here's the mess that Babelfish makes of it:

蒲 the loose age (1,640 ~ 1,715), the character keeps the immortal, the Shandong 淄 Sichuan person. Life in national contradiction and class contradictions unprecedented incisive Ming Moqing at the beginning of. The family background young landlord small-scale merchant family, is not very self-satisfied in the imperial civil service examination field, has mind filled with the practical knowledge, repeatedly is not selected, to 71 years old, has only then tested the tribute student. His discontent has mind filled with, then is chatting the room to write his will differently.

This would be funny if it were not so exceedingly sad that the company behind this work, Systran, actually makes money from the software that did that. Notice that the software did not even get the man's name —the first three characters of the text— correct. And by the way, he's from Shandong Province, not Sichuan. This is from the self-proclaimed “leader mondial des logiciels de traduction.”

The problem is that translation is an exceedingly difficult task for machines. Human languages are complex and organic. They exhibit fluid grammatical and lexical rules that change depending not only on the context and subject matter, but also on the relationships between speakers and listeners, or in the case of written language, writers and readers. And they are constantly evolving. It's a lot easier to design software when the rule sets are fully specified and don't change over time. That is difficult to do for human languages. As a result, machine translation (MT) continues to fail miserably on most fronts.

In Talking to Strangers [Wired Magazine, May 2000], Steve Silberman provides a nice overview of mankind's delusional obsession with machine translation and some of the amusing historical predictions of when this technology would be ready. Silberman has a great quote from Emile Delavenay's Introduction to Machine Translation, published in 1960:

Will the machine translate poetry? To this there is only one possible reply - why not?"

Finding A Simpler Solution

I have a different idea. A very simple idea. An idea that, unlike machine translation, will really work for hundreds of millions of people browsing the web every day.

The idea is based on the observation that huge numbers of people the world over are already literate or partially literate in more than one language. Even right here in America. In the past, “Monolingual America could stand in splendid isolation from the Tower of Babel across the water” [A Multilingual America: The Continuing Challenge (Robert Streeter, ADFL Bulletin, 1973)]. That is certainly no longer the case in America today where the fraction of people who speak a language other than English at home is now approaching one in five [Multilingual America (William H. Frey, findarticles.com, July, 2002)].

So what many people —myself included— need most of the time is not some crazy full-blown machine translation, but instead just a little help with vocabulary words in those second or third languages that we happen to know. Depending who you are and where you live, that second or third language might be English, French, Arabic, Chinese, Japanese, or something else.

To prove my point, let's take a look at an AJAX-based vocabulary assistance service I wrote to help with Chinese. I call this Cicada Assistant because it is based on my AJAX-ified Cicada Chinese-English Dictionary (知了汉英词典). Here's a screenshot of the assistant with the web page about Pu Songling loaded:

Cicada Assistant

All one has to do is to highlight a word in the web page with the mouse and —voila!— the assistant performs a lookup in a dictionary back on the server. You might want to try it yourself.

As shown in the screenshot above, Pu Songling's name in the text is highlighted. His name just happens to be in the database, and thus an entry appears almost instantaneously in the assistant window on the left.

The names of many other famous people you might come across are missing from the database. Of course quality of service will certainly improve as more names are added to the database. But that's not really an issue. Where the service really shines is in allowing the user to quickly highlight and lookup just those Chinese characters and character compounds (i.e., words) he or she doesn't know well.

Any reader of Chinese will instantly recognize that the first three characters are in fact a person's name. And, in the event that the user is not sure how to pronounce that person's name, he or she can always highlight each character individually to find out. Pronounciations are provided in both pinyin and zhuyin fuhao to satisfy people of all educational upbringings.

Notice also that the assistant provides the entry in both simplified and traditional Chinese characters. As a result, this service is useful not only to foreign students of the language, but also to native speakers. The Pu Songling page above is from mainland China and is therefore written in simplified characters. A reader from Taiwan or Hong Kong who is unfamiliar with some of the simplified characters used in the mainland —although undoubtedly able to guess quite a lot from context— may still find it quite convenient to be able to confirm the identity of certain unfamiliar simplified characters. And of course the reverse would be true for a mainland reader perusing a traditional character text from Taiwan or Hongkong.

(A little disclaimer: Although the Cicada Assistant may seem fairly polished, in reality this is still just a demonstration service. The Javascript remains quite rough and the service may not yet work properly on every web site you visit.)

Note that a simple service like this can be done for any language —not just Chinese. And in fact similar services already exist on the web. One notable service is Pattara Kiatisevi's Longdo dictionary service designed for Thai users. Here's a screenshot of the Longdo service with a page from CNN loaded:

Longdo service

From a user's perspective, this service is quite similar to the Cicada service I wrote. As shown in the screen shot, simply hovering over a word ("safety pin") with the mouse brings up an English-to-Thai dictionary entry. Japanese, German, and French-to-Thai services are also available.

From a technical perspective, Pattara's service and my demonstration Cicada service differ a bit. Pattara's service is quite a bit more complicated. On the backend, the Longdo service parses all the words out of the text stream, performs database lookups to find the definitions, and then sends the complete dictionary definition result set (hidden in Javascript) along with the content of the original web page back to the client for viewing.

In the case of the Cicada Assistant, a lot less work is done up front. On the backend, the server simply reads the content of the original web page, adds the small amount of Javascript code needed for the Assistant, and sends it all back to the client for viewing. No advance dictionary lookups are required because queries are performed in a “just in time” manner via AJAX only when actually required.

The approach used in the Cicada Assistant is superior not only because it uses up less bandwidth and computational time, but for other reasons as well. Principly, the text does not need to be parsed into words. While finding words in English, French, or German is facilitated by the spaces between the words, parsing out words in other languages like Chinese and Thai is much more difficult. Word segmentation in many of the languages of Southeast Asia —Thai, Lao, Khmer, Myanmar (Burmese)— is especially difficult as these languages are written using scripts that customarily do not have spaces between words.

The approach of highlighting a word by clicking and dragging the mouse across it thus works for all of these languages. In many cases it is not even necessary to highlight the entire word. Discarding grammatical prefixes and suffixes and highlighting just the root of a word should be sufficient in most cases. Here again we rely on the fact that a human user most often knows how to parse not only the words, but also the grammatical suffixes or prefixes of a given language much better than a machine.

Conclusion

Perhaps a lot of rocket computer science has been done on the grammatical parsing of languages like English, French, and Arabic. For these languages, perhaps sophisticated machine parsing algorithms will become available in a few years and thus open up opportunities for sophisticated machine translation services. But even if such systems become available for English, French, and Arabic, it seems unlikely that speakers of Thai or Khmer will be able to benefit much from that work. Simple web-based "just in time" vocabulary assistants as described here are more likely to help.

In life, all most of us ever really need is a little help from our friends. Let this be the case when using computers as well.

- Ed Trager 2007.02.27