“简体字”不简单 Jiǎn tǐ zì bù jiǎn dān: The Complexity of Simplified Chinese - Part II

Introduction

In Part I of this essay we examined how the new reality of online culture has exacerbated the problem of having two orthographies for Chinese. Chinese educated in the mainland where simplified characters are now taught in schools naturally enough are most comfortable reading texts written in simplified Chinese. Older readers educated in the mainland prior to the orthographic reforms and younger readers educated in Taiwan and Hong Kong where traditional characters continue to be used are naturally most comfortable reading texts written in traditional characters. Because in many cases two or more traditional characters are represented using a single simplified character, mapping between the two orthographies is non-trivial and requires parsing texts at the level of words, not individual characters.

How well are these new realities being handled on today's world-wide web? For example, how are semantically-equivalent queries written in one orthography or the other handled by search engines? Are both simplified and traditional results presented? And how do content providers handle the issue of orthographic conversion?

Although it is not possible to treat these questions exhaustively, in this part we will take a look at one representative example in order to get a rough idea of how things stand.

Let's Do Some Research ...

Suppose you are a Chinese college student. Your professor has asked you to write an essay about a famous Chinese writer of your choice. You like scary movies and ghost stories. You hear about a book called Strange Stories from a Chinese Studio (聊齋誌異 liaó zhaī zhì yì in Chinese) by Pu Songling 蒲松龄. It sounds like fun, so you decide to write your essay on this writer.

Like most young people today, your first choice is probably not running across campus to the library. Your first choice is more likely a little search on the internet. Today you decide to check out the Chinese edition of Wikipedia first.

Typing in Wikipedia the writer's name —“蒲松龄” in simplified characters— produces the following results:

蒲松龄 关联度:100.0% - -
蒲松齡 关联度:2.0% - -
蒙古人名 关联度:1.7% - -
泉城广场 关联度:1.4% - -
...

Oops: here's where we encounter the first problem. Look at the first two entries. The first entry is in simplified characters. The second entry is in traditional characters —only the third character has changed to a more complicated but recognizably similar character. The second entry is, in fact, the very same article. It differs only in being presented in traditional characters instead of simplified characters.

But take a look at the “relevancy” column —“关联度” in Chinese. The second entry —2%— is wrong. Not just a little wrong. Completely wrong. The traditional character article entitled “蒲松齡” is most definitely all about Pu Songling. As surely as the first article is. The two articles are word-for-word 100% identical. They only differ in orthography.

Well, that's the first problem. But, either way, we have now found an article about the writer. So let's take a look at it by clicking on the first link so we can look at the simplified character text.

Oops again. At the very top of the page, before we even get to the article itself, a notice:

目前繁簡轉換系統出現異常,部份字詞可能會轉換錯誤。敬請留意! 中國大陸用戶很可能無法訪問維基百科。若您能瀏覽無礙,請登入後至狀況回報。(注意:若未註冊或登入,您的IP地址會被顯露。)

I won't bother translating the whole thing, just the most interesting first part which says:

Our traditional-to-simplified Chinese conversion system currently exhibits exceptions. Some portion of the characters and words may have been converted incorrectly. We respectively ask you to be aware of this ...

You got that right! Even the warning message itself is displayed almost completely in traditional characters.

And when we move on to the text, we see:

蒲松齡
维基百科,自由的百科全书
(重定向自蒲松龄)
跳转到: 导航, 搜索

蒲松龄(1640年—1715年),生於明朝崇禎十三年,卒於清朝康熙五十四年。字留仙,一字剑臣,别号柳泉居士。山东淄川(今山东淄博市淄川区)人,蒙古族(有爭議)。世称“聊斋先生”,更有“世界短篇小说之王”的美誉。

生平
蒲松龄生活在明末清初,出身小地主小商人家庭,蒲氏為淄川世家,熱中功名。父親蒲槃,此時家道已漸中落,曾娶妻孫氏、董氏、李氏,松齡為董氏子,庶出,地位地落。年少時,正處改朝易鼎之間,張獻忠、李自成軍隊流竄天下,烽火動盪。19岁时参加县府的考试,縣、府、道試均夺得第一名,取中秀才。然而他在之后科举场中很不得意,满腹实学,鄉試屡不中举,只有在46歲時被補為廩膳生,到了71岁時,才被補為贡生而已。平日除微薄田產外,僅能以教書、幕僚維生。

... a mix of traditional and simplified characters.

At this point, we notice however that there is a menu at the top of the page:

不转换

This menu conveniently provides orthography display choices. The items in this menu are:

Don't Convert
Mainland Simplified
Taiwan Traditional
Malaysia/Singapore Simplified
Hong Kong Traditional

... and it appears that Don't Convert is the default selection. Is this the right default? I continue to ponder the reasoning behind this choice.

When we investigate the revision history for this article, it all begins to make sense: Some contributors to this article have typed in traditional characters, and some have typed in simplified characters. The resulting document is therefore a mix of both orthographies. Ok, at least now I understand how we got this mixed-orthography document.

And when we click on the “大陆简体” choice for Mainland Simplified, it does appear that we get a mainly simplified text --except for that message at the top which curiously remains in traditional characters!

Conclusion

This Chinese Wikipedia “work in progress” is just one example among many which we might choose to gain at least a rough idea of the current state of affairs on the Chinese world-wide web. If nothing else, this single example from the very well-known Wikipedia project illustrates that opportunities to improve and facilitate the Chinese user's experience remain wide-open.