Updated: Fri, 2006-08-25 10:28
This page is designed as a guide for those who are not experts in
internationalization in general nor Unicode in particular, but want to use
correct terminology and be persuasive when discussing character sets and the
like. The purpose of this page is to equip the reader to cut through the fear,
doubt, and inaccuracy which often springs up when internationalization rears
its head. This page does not contain an actual description of the various
relevant standards.
Briefest possible guide to Unicode: NEVER CONFUSE A CHARACTER SET WITH AN ENCODING SYSTEM!! Obey this simple rule and people will be very very grateful.
If you desire a slightly longer introduction than that, please read on.
It is not a 'letter' either -- for historical reasons, many things are considered characters which are not letter-like entities at all. For instance, even in ASCII, the simplest and most common character set, 'bell' and 'linefeed' are characters -- not because they deserve to be but because it was once thought convenient.
Abuse of the term 'character set' is probably the biggest single problem in this discussion -- see this page for details.
A character set does not specify what glyph to represent characters with; that depends on whatever is displaying the characters. Thus the same character in the same character set encoded in the same way can be represented with a totally different glyph on different occasions. Theoretically, Unicode is a set of characters, not of glyphs. In practise, however, many glyph-based distinctions have entered Unicode, so that different code points are sometimes used for the same character (see below).
Confusingly, there can be more than one code point assigned to a character. For instance, many existing character sets contain a letter 'C', and when those character sets were assimilated into Unicode the various letter 'C's were not always unified. There are thus several code points that are assigned to 'C' (at least, there's the ASCII C, the Roman-numeral-100 C, and the full-width east Asian C). Sometimes, several code points are given to a character to represent different glyphs being used (Unicode hasn't always been terribly good about the glyph/character distinction) -- so there's also a C-in-a-circle code point, to represent the special glyph used for C when there's a circle around it.
It is the responsibility of software that uses a character set to have correct algorithms for sorting and comparing strings that may have these complex properties.
Systems that are oriented specifically toward Unicode notably include Java and Windows, which used to use UCS2 as their native encoding. Windows XP now uses UTF-16 (and can therefore represent all Unicode characters, the vast majority of them in equal-length representations) while Java supports a kind of 'near-UTF-16'. Java also chose to use a 'near-UTF-8' encoding rather than regular UTF-8 in some circumstances. Oracle uses UCS-2, and therefore cannot understand surrogate characters. However, both Java and Oracle pass through surrogate characters unharmed, even if they do not recognise them as single characters.
The demands placed on a character encoding system vary so widely that it is unlikely there will ever be just one Unicode encoding in use.
Since Unicode appeared, there have been some efforts to create alternatives, such as the Japanese Mojikyo glyph set and the fascinating TRON. However, none of these looks like gaining the widespread acceptance that Unicode has gained.
Briefest possible guide to Unicode: NEVER CONFUSE A CHARACTER SET WITH AN ENCODING SYSTEM!! Obey this simple rule and people will be very very grateful.
If you desire a slightly longer introduction than that, please read on.
Contents:
- Unicode Terms -- read them first!
- Unicode FAQs -- the issues at a glance!
- Unicode Mistakes -- don't make them!
- Unicode Links -- better pages than this one!
Unicode Terms
The terms in this section are vital in any discussion of text representation, not just Unicode.Character
A character is a small, indivisible unit of text, and text is composed of a string of characters. A character is not the binary representation of a text unit on disk; that would be determined by encoding. It is not the shape that appears on the screen; that's the glyph.It is not a 'letter' either -- for historical reasons, many things are considered characters which are not letter-like entities at all. For instance, even in ASCII, the simplest and most common character set, 'bell' and 'linefeed' are characters -- not because they deserve to be but because it was once thought convenient.
Character Set
Also called a character repertoire, this is a set of characters. It is not a way of representing characters on a page or in a file. Rather, it defines the range of characters that can be so represented. Unicode is technically not so much a character set as a coded character set (see below).Abuse of the term 'character set' is probably the biggest single problem in this discussion -- see this page for details.
Encoding
The encoding is the system by which the characters in a set are represented in binary form in a file. The Unicode set may be represented using three encodings: UTF-8, UTF-16 and UTF-32. UCS-2 is an unofficial variant of UTF-16 which does not handle 'surrogate characters', the less-used code points that require more than two bytes to represent. Java uses UCS2 as it's native encoding. UCS-4 is an unofficial variant which is near-identical to UTF-32. Technically, you shouldn't say 'Java uses UCS2 characters internally', you should say 'Java uses UCS2 representations of Unicode characters internally'. The distinction can be important sometimes, honest.Glyph
The Glyph is the visible shape that represents a character. For something like a percent sign '%', there is quite a simple relationship between the character and the glyph. In many languages this relationship is not at all simple -- for instance, in many Sanskrit-derived languages it takes more than one character to specify a glyph. Worse yet, because people have solved typographical problems in different ways over the years, there are many characters that have nothing to do with glyphs, such as the ASCII 'bell' character mentioned above.A character set does not specify what glyph to represent characters with; that depends on whatever is displaying the characters. Thus the same character in the same character set encoded in the same way can be represented with a totally different glyph on different occasions. Theoretically, Unicode is a set of characters, not of glyphs. In practise, however, many glyph-based distinctions have entered Unicode, so that different code points are sometimes used for the same character (see below).
Code Point
A character set defines a set of characters, but does not specify any way to refer to them. You have to assign them numbers so that they can be referred to, and those numbers are termed code points. ASCII has 128 code points, in the range 0 - 127, and the letters commonly used in computing are assigned to various code points in that range. Unicode has 0x10ffff code points, most of which do not yet have characters assigned to them.Confusingly, there can be more than one code point assigned to a character. For instance, many existing character sets contain a letter 'C', and when those character sets were assimilated into Unicode the various letter 'C's were not always unified. There are thus several code points that are assigned to 'C' (at least, there's the ASCII C, the Roman-numeral-100 C, and the full-width east Asian C). Sometimes, several code points are given to a character to represent different glyphs being used (Unicode hasn't always been terribly good about the glyph/character distinction) -- so there's also a C-in-a-circle code point, to represent the special glyph used for C when there's a circle around it.
It is the responsibility of software that uses a character set to have correct algorithms for sorting and comparing strings that may have these complex properties.
Coded Character Set
A character set which has code points matched with characters so that the characters can be referred to is called a coded character set or CCS. Unicode is a CCS, and so are most other published character sets.Surrogate
This term does not apply to character sets in general, only to Unicode in particular. The Unicode code space is divided (like that of some other sets) into 'planes'. Plane 1 contains the vast majority of characters in everyday use. Characters from other planes (the other planes are mostly empty at the moment) are represented by 'surrogate code points', which are little-used now but may become more common. The term 'surrogate character' is an error, as explained by the Unicode Consortium here, but is often used informally to mean a character outside of plane 1. The correct term would be 'supplementary character', although this isn't really everyday vocabulary.Unicode FAQ
This FAQ is intended to answer (or at least to provide some explanation for) questions that are often asked by people who find they are going to have to worry about Unicode.Q: What problems do people have with Unicode?
Not everyone likes Unicode, and some make quite a fuss about it.. I personally neither like nor dislike it. Real problems with Unicode (as opposed to problems with particular encodings or with the human race in general) include:- Not EVERY SINGLE character is represented in Unicode. Unicode will probably never handle cuneiform and the like, so if you work with dead languages Unicode is not much use. The 'classical' characters of some languages, specially Chinese, that are no longer used but appear in important old books, are not always included, although most or all of the traditional Chinese literary canon can be represented in Unicode. Rare alphabets such as Tifinagh are also missing, although the process of adding them still continues.
- CJK Unification. The most controversial part of Unicode by far. Remember how I said that the various letter Qs that existed in pre-Unicode character sets were given their own different code points in Unicode? Well, with Chinese-derived ideograms, they did the opposite, merging them so that Chinese, Traditional Chinese, Japanese, and Korean versions of a character all share a code point. This was called 'Han Unification'. This has caused considerable political friction, and it also makes Unicode hard to use for some purposes in Asia. People disagree (on political lines or just according to how ornery they feel) on how hard it makes it.
-
Inelegance. Because of the rules with which Unicode was made, it is not a
particularly orthogonal or pleasing set of characters, containing many
duplicates, and many confusing near-duplicates. In particular:
- Unicode assimilated the existing popular character sets whole, keeping all their irregularities. This is why, for instance, the Indian scripts area contains many 'characters' that have nothing to do with language but were used as hints to display systems in the earlier days of Indian-language computing.
- Compound characters such as a French e-acute can generally be represented either by one Unicode character or by two (a base and an accent mark), because some of the character sets that were assimilated into Unicode liked to have separate characters for accent marks and some didn't. This makes some tasks complicated.
- Some areas other than Han ideographs have been unified (e.g. Runes).
- Unicode contains characters that are never used, like Deseret, are not really characters, like Terminal Control Codes, or are just plain wacky, like Japanese cartographical icons. Yet it omits some groups of characters that are frequently used, such as i-Mode glyphs.
Q: Where is Unicode used?
Technically, Unicode is used wherever the characters used are all drawn from the Unicode set -- in other words, just about everywhere. Systems that use ASCII are also using Unicode, since Unicode contains the ASCII set (and gives them the same code points they had in ASCII, too).Systems that are oriented specifically toward Unicode notably include Java and Windows, which used to use UCS2 as their native encoding. Windows XP now uses UTF-16 (and can therefore represent all Unicode characters, the vast majority of them in equal-length representations) while Java supports a kind of 'near-UTF-16'. Java also chose to use a 'near-UTF-8' encoding rather than regular UTF-8 in some circumstances. Oracle uses UCS-2, and therefore cannot understand surrogate characters. However, both Java and Oracle pass through surrogate characters unharmed, even if they do not recognise them as single characters.
Q: Why are there so many encoding systems?
Because encodings, even encodings of the same character set, have dramatically different properties. Some properties that people often look for in encodings are:- ASCII transparency: When a text that contains only ASCII characters is encoded, the result should be identical to an ASCII string. This is a very useful property because systems that use such an encoding can process ASCII files without trouble. UTF-8 has this property.
- Uniform character length: It is much, much more efficient to process a string when the characters all take up the same number of bytes. Encodings like UCS2 have this property, which is why Windows and Java use them. Most other encodings do not. An encoding that does have this property can never have ASCII transparency. UCS-2 has this property.
- Completeness: Not all encodings can necessarily encode every character that appears in the set. UCS2 (which has a flat 16 bits for each character) cannot encode surrogate Unicode characters (which are mindbogglingly rare at the moment). This might be a problem in the future. All the UTF-* encodings are complete.
The demands placed on a character encoding system vary so widely that it is unlikely there will ever be just one Unicode encoding in use.
Q: What alternatives are there to Unicode?
Well, if you make sure to stay in America and to program in C on Unix machines and you're picky about who you meet, you have an excellent chance of never having to bother about a non-ASCII character in your life. Those who find this option difficult or distasteful can use one of the many character sets and encoding schemes that existed before Unicode for various languages. The Latin-1 character set contains the accent marks used in Western Europe, and its encoding (i.e. one byte per character) has the three virtues mentioned above. EUC-KR and EUC-JP were the standards for Unix in Japan and Korea for many years, and other regions had and still have their own standards which will not vanish anytime soon.Since Unicode appeared, there have been some efforts to create alternatives, such as the Japanese Mojikyo glyph set and the fascinating TRON. However, none of these looks like gaining the widespread acceptance that Unicode has gained.