Sunday, January 3, 2010

Solving the Gaiji Problem Portably

Unicode codifies many Chinese ideographic characters, but ideographic characters are really doodles in nature, and compositions can be fairly arbitrary. Over the years, there have been some uncommon variants of characters that are used for place or person's names that don't have a unicode code point. These are called Gaiji (in Japanese) or Zaozi (in Chinese). Government and other agencies need to have a system to deal with them.

Traditionally, one would compose his own glyph and allocate a code point from the private use area (PUA) in Unicode, but everyone assigns code points to glyphs and meanings differently. Even if each agency employs a centralized repository to keep track of agency-wide private use area code points (they would have to), they would still have problem exchanging information with another agency or an outside party, who cannot make sense of the private use area.

Most ideographic characters can be decomposed, and Gaijis are no exception. The decomposed glyphlets are actual existing characters. Using some Unicode codified characters for example (they were formerly Gaijis in BIG-5 encoding but now have an assignment in Unicode), 堃 decomposes to 方方土, and 喆 decomposes to 吉吉.

For the purpose of representing Gaiji in Unicode, I advocate the following solution based on certain Unicode control characters.
  • Since Gaiji does not have Unicode representation, it should use the U+FFFD code point (replacement character) to signify this fact.
  • However, the replacement character can be annotated with Ruby Text U+FFF9 (interlinear annotation anchor), U+FFFA (interlinear annotation separator), and U+FFFB (interlinear annotation terminator) to describe how the gaiji can be decomposed.
A typical Gaiji sequence would start with U+FFF9, U+FFFD, U+FFFA, optionally followed by an ideographic description character (IDC), a sequence of decomposed glyphs, and U+FFFB. Essentially, we're marking the Gaiji decomposition as Ruby Text. Software that doesn't understand this scheme can still scan and index the decomposed glyphs. And as long as there is a standard way to normalize decomposition, the gaiji unicode sequence is interchangeable. The IDC provides a hint for the renderer to display the gaiji graphically using existing font.

No comments: