Introduction
The purpose of this list is this: given the name of a character set, find out a little bit about it.
For each character set, the following information is stored:
- The 'main' name of the character set. Where possible, this is the name of the standard it is defined in.
- Other names, nicknames, and aliases by which the set is known.
- Whether the 'character set' is actually a character set or an encoding system, or both. In some cases the entry is for a family of character sets.
- Names of related character sets
- A description of the character set
There is a table of contents which contains 'real' names and aliases.
Issues
This document takes a very vague view of what constitutes a character set, and uses terms roughly according to normal average usage
rather than according to correct usage. For instance, what is called a '
character set' throughout is actually almost always a '
coded character set', a
distinction the ISO themselves manage to forget not infrequently.
For a (basic) discussion of character-set related issues and terms, my
Unicode tutorial could be useful.
This document presumably contains errors and leaves out countless character sets and encoding systems. Please
get in touch with me to correct my failings.
Contents:
All names and nicknames in alphabetical order:
| 7-bit ISO 2022 |
encoding | family |
Also called: ISO-2022
Languages: Any
See also:ISO-2022-JP ISO-2022-JP-2 ISO-2022-KR ISO-2022-CN ISO-2022-CN-EXT
As well as defining the EUC family of encodings for Unix, the ISO 2022 standard defines a set of 7-bit encodings presumably intended for mainframes. These include Japanese, Korean, and Chinese encodings.
7-bit ISO 2022 is an extensible type of encoding, of which certain specializations for particular languages are actually used (e.g. ISO-2022-JP for Japanese). ISO 2022 encoding has the following properties:
- The encoding can represent many character sets, which may be 1-byte or 2-byte
- Escape sequences mark the transition from one character set to another. If you need to use a given character set in a 7-bit ISO 2022 encoding, there must be an ISO escape sequence registered for that set.
- The encoding respects the space traditionally reserved for control characters, so there are 94 possible 1-byte characters and 94*94 possible 2-byte characters. Character sets larger than this cannot be used.
A large number of escape sequences is registered, but still not so many that every useful character set can be ISO 2022 encoded. Escape sequences should begin 'ESC (' for single-byte sets and 'ESC $ (' for double-byte sets
but this rule does not seem to be followed particularly closely.
Designed for use in emails, the ISO-2022 family of encodings would be a good starting point for anyone wondering why the world needed Unicode.
Also called: PC-ISCII
Languages: Pan-Indian
See also:ISCII
ACII is a variant of ISCII designed for increased compatibility with PC 8-bit character sets that contain graphics characters. ACII includes some box-drawing characters at the expense of some of the less
popular ISCII characters (e.g. digits).
Also called: ASCII,
ASCII-1968,
ECMA-6
Languages: Latin
See also:ASCII-1963
This is the good old ASCII we know and love. The ANSI X 3.4 standard specified not only the ASCII character set but a whole series of rules relating to representation on punched tape and so on, now mercifully forgotten.
ASCII made many improvements on ITS2 and FIELDATA, it's predecessors. In particular it included a large number of control codes, it tried to include a superset of all the characters available in the telegraphy character sets of the time,
and it made at least some attempt to lay out punctuation in sensible blocks.
The original 1963 version of ASCII had no lowercase letters and a different array of control characters than the ASCII we are used to. The 1967 version of the standard created modern ASCII. Since most character sets since then have been designed with
ASCII somewhere in mind, the various quirks of 1967 ASCII have become quirks not only of most character sets, but of the way computer engineers think of characters. Some of these quirks are:
- ASCII contains not just characters in the linguistic sense, but characters which represent formatting information (e.g. vertical tab, linefeed)
- Even more strikingly, ASCII contains not just 'content' characters such as text and formatting, but 'connection' characters intended for error checking and teleprinter control. This is a result of the way ASCII was used at the time and of the teletype legacy.
- It was decided that accent marks would be represented in ASCII by designating certain characters as 'diacritical marks' which, when printed over a regular letter, would create a new letter, much as you might type a letter, go back over it and type an accent on a mechanical typewriter.
- Certain other characters were designated 'national use' characters which could be replaced when desired by accented characters. Thus, despite its small size, ASCII managed to have two totally separate systems for producing accented characters.
ASCII is a 7-bit coded character set. When used on 8-bit computers, there is always the question of what to do with the extra 128 code points that become available if the 8th bit it used. Most attempts to represent European languages on computers have focused on assigning various characters to the 128 upper code points
of ASCII, creating various ASCII-compatible 8-bit character sets. These sets are mostly standardized in the ISO-8859 standard.
Note on etiquette: 8-bit character sets based on ASCII should, theoretically, avoid assigning 'printing characters' (i.e. characters that are actual language characters as opposed to control codes) to code points that are the 8-bit equivalent of ASCII control characters. This is for the benefit of 7-bit machines and systems that may strip the 8th bit
from a character. Some manufacturers have paid more attention to this rule than others, depending on how much they needed the extra code points and how much they cared about legacy 7-bit computer systems. Notably, most Microsoft encoding systems produce bytes that violate this rule, and many 8-bit character sets for languages with many accents (e.g. Vietnamese) assign code points in violation of this rule. Many would say that this is a good example of the point at which
backwards compatibility becomes not worth maintaining.
Also called: REACC,
EACC
Languages: CJK, Chinese, Korean, Japanese
See also:CCCII
This standard was created by the Research Libraries Group (a non profit academic organization) based on CCCII. Structure is the same as CCCII but there are some corrections and tweaks:
- Some rare characters and variants were dropped
- Some characters are now considered mere variants
- The simplified layer is now reserved for offically 'simplified' forms rather than any variant that just happens to be simpler than the main form
- More kana, kokuji (Japan-only kanji) and hangul were added
Also called: Early ASCII
Languages: Latin
The original, 1963 version of ASCII specified only uppercase letters but was otherwise similar to modern ASCII. The range now occupied by lowercase letters
was undefined. There were various other differences, especially in control characters and in the inclusion of left and up (but not right and down!) arrow symbols.
Also called: Atari ASCII
Languages: Latin
Atari ASCII was used on Atari's line of 8-bit home computers and was thus reasonably widespread in the 80s. It is a fairly unconventional and difficult ASCII variant. In particular:
- With two exceptions, the upper 128 characters are simply the graphical inverse of the lower 128, and thus not really characters so much as alternate display forms. The exceptions are the end-of-line character and bell character, which both do have their high bit set.
- ATASCII uses the control character area of ASCII (below 0x20) to hold graphics characters, and stores control characters elsewhere.
As with PETSCII, the character set incorporated terminal control codes including cursor movement, and it was thus possible to create animations consisting of an ATASCII string.
| Adobe Standard Encoding |
charset |
Languages: Latin
This is an 8-bit character set used by Adobe in PostScript. The lower half is as ever the same as ASCII; the upper half contains a scattering of typographic and accented characters.
Also called: EBCDIC
Languages: Latin
See also:IBM EBCDIC
This is the name given to EBCDIC versions formed by taking the original EBCDIC dialect and filling in the other characters from ISO-8859-1. For some strange reason, all EBCDIC dialects formed in this way seem to leave code point 0x155 empty.
It appears that there are several Augmented EBCDIC dialects, one for each dialect of original EBCDIC. Presumably they differ in punctuation.
Languages: Chinese
Big5 is a character set originating in Taiwan, used to write traditional Chinese. The name comes from the five companies that collaborated to create it. It specifies just over 13,000 hanzi.
The term 'Big5' has been abused quite a lot over the years. The original Big5 character repertoire is no longer used and the name 'Big5' usually means one of the many extensions. The main extensions are Microsoft's CP950, Big5-ETen, and Big5-HKSCS.
Big5 is both a character set and an encoding. As an encoding, it is a DBCS encoding with lead bytes in the range 0xa1-0xf90 and trails 0x40-0xfe.
Languages: Chinese
This is the largest extension of Big5. It is not currently well supported (if indeed it is supported at all), but it is theoretically the largest Big5 variant. Big5+ uses code points that
clash with the GCCS and HKSCS Hong Kong oriented Big5 variants.
| Big5-ETen |
charset | encoding |
Also called: Big5
Languages: Chinese
An extension to Big5 by ETen Information Systems, Big5-ETen has kana, cyrillic characters and circled digits, although not at the same code points as other Big5
extensions. Big5-ETen is a superset of CP950.
| Big5-HKSCS |
charset | encoding |
Also called: HK SCS-200
Languages: Chinese
This extension to Big5 is a superset of CP950. It adds characters used in Hong Kong. In 1999 it replaced the HK government's older GCCS extension to Big5.
Languages: CJK, Chinese, Korean, Japanese
See also:EACC
Developed in Taiwan, CCCII (stands for Chinese Character Code for Information Interchange) is a very comprehensive system for representing all characters found in all forms of Chinese, Japanese and Korean.
CCCII is composed of 94 planes, which are 94x94 code points. A 'layer' is made up of six planes. The layers are occupied as follows:
- 1: Symbols and traditional han characters
- 2: Simplified han characters
- 3-12: Variant han character forms
- 13: Japanese kana and kokuji (Japanese-only kanji)
- 14: Korean hangul
- 15: Reserved
- 16: Misc. Korean and Japanese characters
The first twelve planes have a special relationship. A given code point corresponds to a given han character, no matter which layer is being used. The code point on layer 1
is occupied by the traditional form, on layer 2 the same code point is occupied by the simplified form, and on higher layers the variants are stored.
CCCII has some structure within the layers, as well. Within layer one, hanzi are divided into three groups based on rarity. Within layer two, there is a distinction between PRC-specific simplifications and generic simplifications.
CCCII is ideal for bibliographic and scholarly purposes but not much used elsewhere. It also has some problems, including many repeated characters and the fact that no commodity software can read it. EACC is a subset of CCCII that removes many problematic characters.
| CNS 11643 |
charset | encoding |
Languages: Chinese
The CNS (Chinese National Standard) character set and encoding system is an extremely comprehensive system for representing Traditional Chinese, encoded with 3 bytes per character.
The original 1986 standard defined 16 planes of 94x94 characters each. Planes 1 and 2 contained Big5 characters (but not in Big5 order), plane 14 contained user characters.
The 1992 standard filled in many other planes. 1 and 2 are still Big5, but the others are as follows:
- 3: About 6000 hanzi from the original plane 14
- 4: About 7000 rare hanzi, and some hanzi from Unicode that were not included in Big5
- 5: 8600 rare hanzi
- 6: 6388 variant forms with up to 14 strokes
- 7: 6539 variant forms with over 14 strokes, perhaps the most nightmarishly difficult set of han characters ever encoded
An eighth plane of about 7000 even more abstruse characters is thought to be under development.
Although CNS 11643 is the national standard of Taiwan, Big5 is much more common in practise.
Languages: Latin
Microsoft's extension of ISO-8859-2. Mysteriously, it's number is lower than that of Microsoft's ISO-8859-1 variant.
Languages: Cyrillic
See also:KOI-8R
This is the Microsoft code page for Cyrillic. It is available in two flavors, Standard and Russian. Both focus on including the largest possible number of Cyrillic characters, even more than KOI-8 Unified. Both character ordering and graphics characters are sacrificed but the result
is the richest repertoire of Cyrillic characters in any 8-bit character set. The 'Russian' flavor includes accented characters.
CP1251 seems to have replaced KOI-8R as the most common Cyrillic character set.
Languages: Latin
Microsoft extension of ISO-8859-1 (Latin1). Has a euro symbol.
Also called: WinGreek
Languages: Greek
See also:ISO-8859-7
The Microsoft extention to ISO-8859-7 has some troubling incompatibilities; notably, the capital-alpha-with-tonos is on a different code point. As with many Microsoft code pages,
code points in the 0x80 to 0x9f control character range have been assigned to printable characters. Tut.
Languages: Latin, Turkish
The Microsoft code page for Turkish. Based on ISO-8859-3.
Also called: WinArabic
Languages: Arabic
This is Microsoft's modified version of ISO-8859-6. There are considerable differences, in that Microsoft try to preserve ISO-8859-1 compatibility by putting
accented letters and symbols in their Latin1 positions and then filling in Arabic characters around them.
Languages: Latin
The Microsoft code page for Baltic languages. Based on ISO-8859-4.
Languages: Vietnamese
See also:VISCII
This is the Microsoft character set/encoding for Vietnamese. It is based on the TCVN5712 standard but with some minor changes, perhaps so as to be more compatible with Latin1.
Also called: DosLatinUS,
OEM437,
IBM CP 437
Languages: Latin
The character set used by American DOS versions, specified by IBM. It included a few accent marks and many, many graphics characters. In particular, it included graphics characters
below 0x20 (traditionally the control code area).
Also called: DosGreek
Languages: Greek
The DOS Greek 8-bit character set.
Also called: DosBaltRim
Languages: Latin
The DOS Baltic character set.
Also called: DosLatin1
Languages: Latin
Later DOS versions used CP850 instead of CP437. CP850 had the Latin1 (ISO-8859-1) repertoire, but positioned so as to be compatible with CP437.
Also called: DosLatin2
Languages: Latin
The Latin2 repertoire, ordered so as to be compatible with the original DOS character set, CP437.
Also called: DosCyrillic
Languages: Cyrillic
See also:KOI-8
The original cyrillic character set for DOS, ordered so as to be compatible with the original DOS character set, CP437. It was therefore not KOI-8 compatible.
Also called: DosTurkish
Languages: Latin
The DOS Turkish character set.
Also called: DOSPortuguese
Languages: Latin
The DOS Portuguese character set.
Also called: DOSIcelandic
Languages: Latin
The DOC Icelandic character set.
Also called: DOSHebrew
Languages: Hebrew
The DOS Hebrew character set.
Also called: DOSCanadaF
Languages: Latin
The DOS French Canadian character set.
Also called: DOSArabic
Languages: Arabic
The DOS Arabic character set.
Also called: DOSNordic
Languages: Latin
The DOS Scandinavian character set.
Also called: DosCyrillicRussian
Languages: Cyrillic
See also:KOI-8 Alternative CP855
A KOI-8 Alternativny based set of cyrillic characters used on DOS. It replaced the KOI-incompatible CP855.
Also called: DOSGreek2,
IBM Modern Greek
Languages: Greek
See also:CP737
This alternative DOS Greek standard replaced the earlier CP737 with a repertoire and ordering based on IBM usage.
Languages: Thai
Microsoft's CP874 code page is based on TIS-620, the usual 8-bit Thai set, but adds some extra characters in unused code points.
Also called: Shift-JIS,
SJIS
Languages: Japanese
CP932 is Microsoft's favored way of representing Japanese (at least up until the rise of XML and UTF-*). It is a combination of the JIS X 0201 and JIS X 0208 character sets together with an encoding system whereby all the 8bit code points that do not represent hiragana are used as lead bytes for kanji.
Unlike EUC-JP, Shift-JIS is not ASCII compatible (code points that should be control codes are used as lead bytes) and nor is it particularly simple to process. It is also impossible to represent the JIS X 0212 kanji in this encoding scheme.
Languages: Chinese
This is Microsoft's favored Chinese encoding/character set combination. It is an extension of EUC-CN that covers all Unicode han characters.
Also called: UHC,
Unified Hangul Code
Languages: Korean
This is Microsoft's favored way of representing Korean. It is a derivative of EUC-KR, extended to include all johab precomposed hangul. Like other
east asian Microsoft encodings, it allows ASCII trail bytes, and lead bytes in the control code range, thus losing a form of ASCII compatibility.
Also called: Big5
Languages: Chinese
CP950 is Microsoft's version of Big5, usually referred to as 'Big5' in Microsoft environments. It incorporates various extensions to the original Big5 character set.
CP950 defines a block of characters in the range 0xF9D6-0xF9FE. All the other Big5 extensions keep this range of characters and are therefore supersets of CP950. This makes
CP950 a common choice when deciding what variant of Big5 to support, although the official Taiwanese standard would be CNS 11643 encoded as EUC-TW.
Also called: EBCDIC
Languages: Cyrillic
See also:IBM EBCDIC
Cyrillic EBCDIC abandons the lowercase Roman letters to make way for a rather abbreviated list of Cyrillic characters. Other characters are then added in the punctuation area, and the whole thing is structured so that when 'folded' in IBM punched card style,
the resulting character set contains upper case Cyrillic and some (but not all) punctuation.
Languages: Japanese
See also:Super DEC Kanji
This encoding system was developed by Digital Equipment Corporation to represent Japanese. It can encode the JIS X 0201 and JIS X 0208 character sets. DEC Kanji also allowed about 2000 user-defined characters.
It is obsolete compared to Super DEC Kanji.
See also:ISO-8859-1
DEC-MCS was the 'Multinational Character Set' used in DEC's vt220 terminals. It formed the basis of, and is a subset of, the more famous ISO-8859-1 set. The Latin letters eth and thorn, the international currency symbol, and a couple of other punctuatoin marks were added to make ISO-8859-1.
Also called: DGI
The DG-International character set was formerly used with the DG Interactive Cobol environment. It included ASCII and 69 extra characters.
Also called: KOI-8R,
RFC 1489
Languages: Cyrillic
See also:KOI-8
KOI-8R was a character set proposed in the 1980's by the Demos company. It was based on KOI-8 but replaced non-Russian characters with graphics characters, and added the 'dotted e' character
which was missing from KOI-8. The former change was not terribly popular but the latter was necessary, so many KOI-8 variants and hacks were made that included a 'dotted e'. The term KOI-8R often seems to
be used to mean 'KOI-8 with a dotted e'. KOI-8R was, and may still be, the most widely used Cyrillic character set.
Languages: Greek
See also:ISO-8859-7
This Greek character set was uppercase-only. It was superseded by ELOT-928, which in turn was standardized as ISO-8859-7.
Languages: CJK, Japanese, Chinese, Korean
See also:EUC-JP EUC-TW EUC-CN EUC-KR
The EUC encoding systems are a group of encodings for CJK character sets. They were defined in ISO-2022 for use in 8-bit systems (i.e. Unix as opposed to mainframes).
EUC stands for Extended Unix Code and the system has been primarily used on Unix. EUC encodings allow the use of four 'code sets', of which set 0 is always the local equivalent of ASCII (e.g. JIS X 0201 for Japanese encoding). The other three may be unused or may correspond to
a particular character set that is being EUC encoded.
The four flavors of EUC encoding in use are EUC-JP, EUC-CN, EUC-KR, and EUC-TW.
Languages: Chinese
See also:EUC
This is the EUC encoding for simplified Chinese. The code sets are:
- Set 1: GB 1988
- Set 2: GB 2312
- Set 3: unused
- Set 4: unused
Languages: Japanese
See also:EUC
This is the EUC encoding for Japanese. The code sets are assigned as follows:
- Set 1: JIS X 0201 (i.e. Roman)
- Set 2: JIS X 0208
- Set 3: Half-width katakana
- Set 4: JIS X 0212
The presense of half-width katakana in this encoding (although not as part of any common character set) extends its repertoire to be equivalant to that of Microsoft's Shift-JIS. EUC, however, is better behaved in that it does not use control-character codes illegally.
Languages: Korean
See also:EUC
This is the Korean EUC encoding. The code sets are:
- Set 1: KS C 5636 (Roman)
- Set 2: KS C 5601
- Set 3: unused
- Set 4: unused
Languages: Chinese
See also:EUC
This is the EUC encoding for traditional (Taiwanese) Chinese. The code sets are:
- Set 1: ASCII
- Set 2: CNS 11643 Plane 1
- Set 3: CNS 11643 Planes 1-16
- Set 4: unused
Code set 2 takes less space to encode in EUC, so the duplication of CNS 11643 Plane 1 allows common characters to be represented more concisely.
Also called: DoD 8-bit Code
Languages: Latin
See also:ASCII
FIELDATA is a character set used in the Cold War-era US military. It was in some ways the ancestor of ASCII. Over 128 code points, it distributes upper and
lowercase Roman letters, a rather miserly allocation of punctuation, the numerals, and a large number of control codes. (Although in fact, FIELDATA predates the concept of a code point).
FIELDATA may still be in use in some 60's era computers.
Languages: Chinese, Uighur
This character set contains 70 primary and 72 supplementary characters for writing the Uighur script, an Arabic-derived script.
Languages: Chinese, Korean
The official PRC standard for the Korean script.
This set is identical to the Chinese basic set (GB 2312) up until row 9 -- in other words, latin, greek, kana, bopomofo and pinyin characters are all the same. The sole exception is that the currency sign is not a yuan sign (nor even a won sign) but a dollar sign.
In subsequent rows, about 5000 pre-combined hangul are defined, although the ordering is unlike Korean standards. There are also 94 hanja, which are 'idu', ancient han-character-based phonetic characters, rather than the kind used in Korea today.
Also called: GB 13000-1.93
Languages: Chinese
This is the Chinese version of ISO 10646 (Unicode). It is identical to the ISO specification and is kept in sync with it.
Languages: Chinese, Yi
This character set is a double byte 94x94 representation of the Yi script (an ideographic script used in Sichuan, Yunnan, Guizhou and Guangzi).
Languages: Chinese, Tibetan
This character set includes 169 Tibetan letters, digits, symbols, and control codes. The symbols include astronomical and mathematical symbols. Both Tibetan characters and the characters used to indicate Sanskrit transliteration are included, so the total character repertoire is likely as large as the number of surviving Tibetans.
Languages: Chinese
Until recently, han characters added to unicode were added to the GB 13000 standard (i.e. the Chinese reflection of the Unicode standard) and to GBK, the character set for normal use. However, GBK ran out of code points and was unable to represent the 6,502 han characters
of CJK Unified Ideographs Extension A when those characters arrived in Unicode 3.0.
GB 18030 was therefore created to represent all the Unicode 3.0 hanzi. It is compatible with GBZ and the now-aged GB 2312 set, yet covers all Unicode code points. It is not yet as widely used as the older sets, however.
Also called: GB-Roman
Languages: Chinese
The ASCII variant of mainland China, identical to ASCII but for the dollar sign, which is replaced with a yuan sign.
GB stands for Guo Biao (National Standard), and indicates an official People's Republic of China standard.
Also called: GB Internal Code,
GB 2312-80
Languages: Chinese
GB 2312 is the basic Simplified Chinese character set. It has a strong resemblance to JIS X 0208, the basic Japanese character set. In particular, it includes
kana, greek, and cyrillic characters in the same area, and divides han characters into two levels, with level 1 arranged by reading and level 2 ordered by radical and stroke count.
GB 2312 may be represented in either 7-bit form or 8-bit form, depending on whether compatibility with 7-bit systems is more important than distinguishing Chinese characters from ASCII characters.
If GB 2312 is being represented in 8-bit form, the high bit of each byte is set to 1. This effectively creates a new character set. The combination of this set with ASCII is known as 'GB Internal Code'.
GB 2312 is usually encoded in either the HZ or EUC-CN systems.
Also called: GB2
Languages: Chinese
A set of 7237 supplementary hanzi for GB 2312. Also known as GB2.
Languages: Chinese
A set of 7039 supplementary hanzi for GB 2312. Also known as GB4.
Languages: Chinese, Mongolian
This character set contains 94 characters representing the post-Revolution Mongolian alphabet. This is the last vertical-only writing system left, and is distantly descended from Sanskrit via Uighur. It has been extensively normalized in recent times.
Also called: GB1,
GB/T 12345-90
Languages: Chinese
GB 12345 is the traditional equivalent of the simplified character set GB 2312. It is used for representing traditional mainland Chinese as opposed to traditional Taiwanese Chinese. It is also called 'GB1'.
Also called: GB3,
GB/T 13131-9X
Languages: Chinese
The traditional Chinese version of GB 7589. Also known as GB3.
Also called: GB5,
GB/T 13132-9X
Languages: Chinese
The traditional Chinese version of GB 7590. Also known as GB5.
Also called: CP936
Languages: Chinese
GBK is both a character set and an encoding. As a character set, it is a superset of GB 2312, and includes traditional as well as simplified hanzi. The encoding
is variable length, with one byte for ASCII and two for GBK characters.
GBK was created because of a need to include the extra Unicode characters from GB 13000 in a GB 2312 compatible coded character set. Therefore, in GBK the characters of GB 2312 occupy their original code points and the GB 13000 characters are fitted in around them.
Microsoft's CP936 is actually another name for GBK.
Also called: Big5-GCCS
Languages: Chinese
This extension to Big5 was developed by the Hong Kong government (it stands for Government Chinese Character Set). It introduced Japanese kana, some simplified hanzi and variant glyphs, and most importantly
Hong Kong placenames to Big5. It is now superseded by Big5-HKSCS.
Languages: Cyrillic
GOST-13052 was an old Russian Cyrillic 7-bit character set. Being 7-bit it had to store characters on top of the ASCII range. Ingeniously, Cyrillic letters were assigned code points in such a way to to
correspond to ASCII letters of the opposite case. Thus, when GOST text was viewed as ASCII, it was just barely understandable, and could be distinguished from ASCII by the fact that words tended to start with a lowercase
letter and continue with uppercase ones.
The property of a Cyrillic character set being readable when viewed as ASCII, or easily transformed into ASCII, persisted in the KOI-* family of Cyrillic coded character sets.
Also called: KOI-7,
KOI-8,
Original GOST-19768
Languages: Cyrillic
See also:GOST-13052 New GOST-19768
The GOST-19768 standard defined two character sets, KOI-7 and KOI-8. KOI-7 is a 7-bit character set that included only capital Roman letters and has not had much impact on history.
KOI-8 became the basis of more than 20 years of Cyrillic character sets. It was an 8-bit set, with ASCII characters in the low half and cyrillic in the high half. It had the property inherited from the earlier
GOST-13052 character set, that stripping the high bit from the Cyrillic characters would make them somewhat readable as ASCII.
KOI-8 was often used in a slight extended form, with the 'dotted e' character added at points 0xa3 and 0xb3. This character had been left out in
GOST-19768.
The 1987 version of GOST-19768 changed the character ordering completely and has a separate entry in this list.
Also called: GT Code,
GT Font
Languages: Japanese
See also:Mojikyo
GT Code is a coded character set for Japanese kanji, which also contains a large amount of meta information to assist in kanji searching and categorizing. Like the similar but more widespread Mojikyo, GT Code is more of a glyph
set than a character set in many ways. GT Code contains about 70,000 entries, far more than Unicode but less than Mojikyo. Unlike Mojikyo, however, GT Code contains only kanji, so it may be the largest set of kanji electronically collected.
GT Code is a product of the Tokyo University Multilingual Research Society. It is intended more as a database of information about characters than as a way of representing text in bulk.
Also called: HP-Roman
Languages: Latin
This 8-bit ASCII-compatible character set was used by Hewlett-Packard on their HPUX OS and HPTerm terminals. It contains various
Western European accented characters.
Also called: HZ-GB-2312
Languages: Chinese
HZ is a system usually used to encode GB 2312-80, or one of its many variants. It is exactly like ISO 2022 7-bit encoding, but the 'escape sequences' that are characteristic of that kind of encoding are strings of ASCII characters instead. Specifically, the tilde is used as an escape character.
Also called: DBCS,
DBCS PC,
DBCS Host
Languages: All
IBM DBCS is the double-byte system used on many IBM systems (those that aren't restricted to EBCDIC). There are two very different flavors:
- DBCS-PC: In practise, this system represents Japanese as Shift-JIS and Korean as EUC-KR.
- DBCS-Host: This uses markers to shift between 1 and 2 byte modes, and can represent any set of characters with 16 bit code points.
DBCS-PC actually specifies only the double-byte part of a multibyte (i.e. variable length characters) encoding system. The user has to pick a single-byte
character set to use with IBM DBCS.
DBCS-Host uses EBCDIC as the character set for single-byte characters.
Also called: DBCS-EUC
Languages: All
IBM developed DBCS-EUC for representing CJK characters on AIX. It is closely related to EUC encoding.
| IBM EBCDIC |
charset | family |
Also called: EBCDIC
Languages: Latin
See also:Localized EBCDIC Original EBCDIC Augmented EBCDIC Japanese EBCDIC Cyrillic EBCDIC
EBCDIC is an encoding, or rather a large family of related encodings, used by IBM. EBCDIC is 8-bit, but unlike most 8-bit encodings it does not have a lower half similar to ASCII and an upper half customized for local needs. Rather, characters
are placed according to the historical needs of punched card machines. This results in the Roman alphabet being stored in several non-contiguous regions.
EBCDIC is legendary for its complexity, its multitude of incompatible dialects, and the way almost every implementation cheerfully ignores most relevant rules.
The original non-contiguous character layout of EBCDIC was rooted in the idea that the two halves of the character set, when superimposed, could form a smaller yet still useful character set. Many later EBCDIC versions break this requirement.
EBCDIC stands for 'Extended Binary Coded Decimal Information Code', a name that seems to make sense until you read it again more slowly.
Also called: ISCII
Languages: Pan-Indian
Indian Script Code For Information Interchange (ISCII) emerged in 1993 as the standard 8-bit character set for India. India's wealth of languages has always posed
unique challenges, and the effect on ISCII has been that unlike other national 8-bit standards, this character set has a very strong distinction between characters and glyphs. The high (non-ASCII) half of ISCII specifies about 80 characters which can be
used with Devanagari or other glyphs to write various languages. Many code points in ISCII do not map directly to a displayed glyph but are interpreted according to the glyph set being used. For example, there are meta-characters
that indicate a bare vowel or an 'alternative' glyph, the interpretation of these terms being left up to the software displaying the text.
In many ways ISCII is more like a code that requires an interpreter to render it into readable glyphs than like a conventional character set.
ISCII can be used with the following glyph sets:
- Devanagari
- Gujarati
- Gurmukhi
- Oriya
- Bengali
- Assamese
- Telugu
- Kannada
- Malayalam
- Tamil
...which is also the range of Indian scripts available in Unicode. Because the Indian area of Unicode is based on ISCII, some of the dummy characters and metacharacters that were needed in ISCII are now enshrined in Unicode, even though they do not correspond to any actual language entity. This is one of the problems with the way Unicode was first compiled...
ISCII is perhaps the most interesting and ingenious of the 8-bit ASCII-based character sets. It is also the hardest to use because the renderer must resolve many ligatures, diacritical marks and other things that are only hinted at in the ISCII byte stream, *and* do so for more than one glyph set!
Also called: C-DAC
Languages: Pan-Indian
See also:ISCII
Although the original ISCII standard was able to represent most Indian languages intelligibly, its limited number of code points could not express, even with metacharacters, the amount of information
needed for Indian language processing, leaving most decisions at the mercy of the text rendering agent -- usually the font. C-DAC, a company, therefore developed the ISFOC (Indian Standard Font Code) which standardises the rendering of the text
and also serves as an encoding scheme and character set, thus eliminating the role of ISCII.
ISFOC (Intelligence Based Script Font Code, an acronym that doesn't seem to fit very well) is a coded character set containing all the basic elements required for rendering an Indian script. ISFOC 'character' are not characters or even linguistically recognizable entities of any kind; they
are elements which are combined jigsaw-style to build up a glyph. Seperate ISFOC sets exist for the different Indian scripts, and sets exist for scripts like Tibetan that are not covered by ISCII. However, they are unified by the fact that algorithms (ISFA) are defined to convert each one to and from ISCII.
ISFOC allows 188 entities per script, which is not enough to display some scripts optimally. It also only allows the display of one script at once. Because of this, and because the full repertoire of ISFOC and ISCII is in Unicode, Unicode will probably eventually become the
most popular way to represent Indian language text.
Also called: Unicode
Languages: Any
See also:UTF-7 UTF-8 UTF-16 UTF-32 UCS-2 UCS-4
The development of the mighty ISO-10646 or 'Unicode' character set is perhaps the most significant development ever in internationalization. The aim of Unicode is nothing less than to contain every character in the world, and while there are many well-discussed flaws in Unicode it is already an invaluable character
set for many languages. With Unicode the dream of being able to process text without thinking about the particular keyboard it was typed in from took a step closer to reality.
Unicode is managed by the Unicode Consortium, a large and diverse group that is something like the opposite of the World Wide Web Consortium, in that it puts out standards at an annoyingly slow rate.
Although Unicode is primarily a character set, the Unicode standard actually contains a wealth of other information, including character types and widths and normalization data. This latter is very important because of the high level of duplicate characters, variant characters, and combining characters in Unicode.
Unicode has various problems, which can be briefly summarized thus:
- The original strategy was to include existing character sets in Unicode wholesale. This results in many duplicate characters, or characters that are non-linguistic but were included in earlier character sets for convenience.
- For the same reason, many characters that had obscure technical uses in their original character sets are present in Unicode even though they have no meaning outside their original set.
- Some groups of characters, notably hanzi/hanja/kanji, were 'unified' meaning that variants deemed to be the same root character were given the same code point. This caused various problems, especially with Japanese names, and as a result more blocks of characters containing variants had to be added later. This has resulted in a very vague notion of what constitutes a 'character' in CJK ideographs, and a certain amount of bad feeling.
- Unicode includes both combining characters (accents and base characters seperately) and pre-combined characters. This means that some text can be represented in many, many ways in Unicode and makes normalization a huge and difficult enterprise.
- Although it was originally stated that Unicode would store characters, and only characters, not glyphs or other entities, in practise there is no strong distinction between characters, variant characters, glyphs, and variant glyphs. This is especially true of CJK ideographs.
- The process of adding new groups of characters to Unicode is very, very slow, and mistakes (as with Runic unification) are difficult to ever correct.
Despite these issues, most would agree that Unicode is a tremendously useful tool. Furthermore, the Unicode standard specifies a number of encodings, at least one of which is suitable for practically any environment, be it 7-bit mainframes, 8-bit Unix, or modern environments such as Java or .NET.
Those scripts that did not develop any character set or encoding standards before the advent of Unicode will almost certainly wind up using Unicode. This includes Khmer and the African scripts (Tifinagh, Ethiopic etc) as well as many historical scripts. Support for cuneiform and Linear B is, however, sadly still far away.
Because the process for adding new ranges of characters to ISO-10646 is extremely slow, there are large numbers of scripts whose most formal computer representation is as an 'Annex' to the ISO-10646 standard. In some cases these Annexes resemble independant character sets (which are waiting to have Unicode code points allocated to them and thus to become coded character sets).
Languages: Chinese
See also:7-bit ISO 2022 ISO-2022-CN-EXT
This specialization of 7-bit ISO-2022 encoding is used for Chinese. Like ISO-2022-KR, regular ISO escape sequences are eschewed in favor of shift sequences. These shift sequences do not toggle the character stream from one set to another,
but are used before every single character. The following character sets are supported:
- ASCII
- GB 2312
- CNS 11643 Plane 1
- CNS 11643 Plane 2
Thus both simplified and traditional characters can be represented.
Any line on which a character from a given set (other than ASCII) appears must be marked with a 'designator sequence' indicating that set. This is vaguely similar to ISO-2022-KR, except that in KR the designator need only appear once per file.
In sum, ISO-2022-CN bears no resemblance to the theoretical generic ISO-2022 encoding or to anything that a sane waking human could be expected to imagine. This sort of encoding is the reason that, with all its faults, we should be very very grateful for Unicode.
Languages: Chinese
See also:7-bit ISO 2022 ISO-2022-CN
This ISO-2022 encoding extends ISO-2022-CN by adding support for about a dozen more character sets, including GB/T 12345, planes 3 to 7 of CNS 11653, and GB 7590. Each of these character sets has an ISO-registered designation
sequence, as demanded by the rules of ISO-2022-CN.
Languages: Japanese
See also:7-bit ISO 2022
This specialization of 7-bit ISO-2022 encoding is used for Japanese. The permitted character sets are
- ASCII
- JIS X 0201
- JIS C 6226
- JIS X 0208
This is a very limited set indeed, which is why ISO-2022-JP-2 is used instead.
Languages: Japanese
See also:7-bit ISO 2022 ISO-2022-JP
This specialization of 7-bit ISO-2022 encoding is used for Japanese. It includes more character sets than the earlier
ISO-2022-JP standard, to wit:
- JIS X 0212
- GB 2312
- KS C 5601
In other words, it includes common Chinese and Korean characters as well as Japanese ones. This standard was introduced before
Chinese and Korean had ISO-2022 encodings of their own.
Languages: Korean
See also:7-bit ISO 2022
This specialization of 7-bit ISO-2022 encoding is used for Korean. It permits only two character sets (ASCII and KS C 5601). Furthermore, it defines a 'designator sequence', an escape sequence that must
appear in any document in which non-ASCII characters occur, before the first non-ASCII character. Furthermore, the 'escape sequences' used to switch between ASCII and Korean characters are not actually escape sequences
(they do not start with an escape character).
These changes reflect the needs of email systems, and ISO-2022-KR has been in widespread use since 1991 in Korean emails.
Also called: International ASCII
Languages: Latin
The ISO-646 standard specified several national versions of ASCII, i.e. ASCII-like 7 bit character sets. These generally replaced the less-used characters in
traditional ASCII with accent marks, local currency symbols, and what have you. All character sets specified in ISO-646 are made obsolete by those in ISO-8859 (which
in turn should really be considered obsoleted by ISO-10646, Unicode).
Languages: Any
See also:ISO-8859-1 ISO-8859-2 ISO-8859-3 ISO-8859-4 ISO-8859-5 ISO-8859-6 ISO-8859-7 ISO-8859-8 ISO-8859-9 ISO-8859-10 ISO-8859-11 ISO-8859-12 ISO-8859-13
ISO 8859 is a standard that specifies a large number of 8-bit character sets. The lower (7-bit) half of each set is ASCII. The upper half contains a set of characters suited for a particular range of languages;
for instance ISO 8859-2 handles central and eastern European languages that use Roman characters.
Because ISO 8859 character sets are small, they do not handle CJK (Chinese/Japanese/Korean) characters. There is an ISO 8859 standard for practically every other script, though, with more still under consideration.
ISO 8859 has been an important standard for Latin languages (i.e. those using Roman characters) in particular, but is probably losing ground to Unicode now, since Unicode makes it possible to represent all the characters in all
ISO 8859 sets at once.
Also called: Latin1
Languages: Latin
See also:ISO-8859
This is the ubiquitous Latin-1 character set. It handles all western European languages, and as an added bonus it also handles all African languages except Bantu languages.
Also called: Latin10
Languages: Latin
See also:ISO-8859
This is derived from ISO-8859-4; it drops Latvian support and adds Lapp and Icelandic support, thus becoming the ISO-8859 charset for Scandinavia.
Also called: TIS-620
Languages: Thai
TIS-620 is the Thai character set used in Thailand (other Thai dialects may be represented in differenct character sets). It is in the process of being approved as ISO-8859-11.
TIS-620 is an 8-bit set of which the low half is of course ASCII. The baht (currency) sign is put in the high half, rather than replacing the ASCII dollar.
All Thai characters are also present in Unicode and a TIS-620 to Unicode mapping is not difficult.
Languages: Latin
See also:ISO-8859
This is the (provisional) ISO-8859 standard for the Baltic. It has the Latvian characters that were lost in Latin6.
Also called: Latin8
Languages: Latin
See also:ISO-8859
This is the ISO-8859 standard for Celtic languages. It includes a UK pound sign.
Also called: Latin9
Languages: Latin
This is the (provisional) replacement for ISO-8859-1. It removes some less-used symbols and adds French and Finnish letters. It also replaces the international currency sign with a Euro sign (Latin1 lacks a euro sign, hence the popularity of Microsoft's CP1252).
Also called: Latin2
Languages: Latin
See also:ISO-8859
8-bit character set for central and eastern (non-Cyrillic) europe.
Also called: Latin3
Languages: Latin
See also:ISO-8859
The ISO-8859 8-bit character set for esperanto, maltese, and turkish.
Also called: Latin4
Languages: Latin
See also:ISO-8859
The ISO-8859 8-bit character set for Baltic languages.
Languages: Cyrillic
See also:ISO-8859 KOI-8
The ISO 8859 8-bit character set for Cyrillic. It consisted of a rearrangement of the characters in ISO-IR-111 into non-KOI-style positions (i.e. from ASCII-compatible to alphabetic order). However, due to non-Russian Cyrillic characters
being inserted in odd places, the ordering is not actually alphabetically correct, so this particular ISO-8859 standard seems not to be used much.
Languages: Arabic
See also:ISO-8859
The ISO-8859 8-bit Arabic character set.
Also called: ELOT-928,
Latin/Greek
Languages: Greek
See also:ISO-8859
The ISO-8859 8-bit Greek character set.
Also called: CP1255
Languages: Hebrew
See also:ISO-8859
The ISO-8859 8-bit Hebrew character set. Microsoft's CP1255 is exactly the same.
Also called: ECMA-Cyrillic
Languages: Cyrillic
See also:KOI-8
This set is a compromise between ISO-8859 (specifically, ISO-8859-5) and the KOI family of Cyrillic character sets. It kept the KOI character order for Russian letters, and added Ukrainian, Byelorussian, and other non-Russian characters
in the empty code points.
Also called: Baudot
Languages: Latin
The 'International Telegraph Alphabet 2' was used on some extremely early computer equipment.
Languages: Japanese
JEF (Japanese Enhanced Feature) is an encoding system for kanji used on Fujitsu systems.
Also called: JIS-Roman,
JISCII
Languages: Japanese
The oldest Japanese character set standard, JIS X 0201 contains two groups of characters: JIS-Roman and half-width katakana.
The main difference between JIS-Roman and ASCII is the in JIS-Roman there is a yen symbol instead of a backslash. This oddity persists in modern-day fonts, to the point where a yen sign
is regarded by many as an acceptable alternative glyph for the backslash character.
Half-width katakana are the minimal set of katakana used in ATMs, with the consonant strength markers as seperate characters. A small number of Japanese punctuation characters
is included in the katakana area.
Also called: JIS C 6226
Languages: Japanese
See also:Mojikyo JIS X 0213
Basic Japanese character set with kana, greek, roman and cyrillic characters. Contains 6,355 kanji, divided into two levels. Kanji in the first level are arranged according to reading, while the second level kanji are arranged by radical and stroke count.
JIS X 0208 has been a troubled standard. Originally published in 1978 as the error-packed JIS C 6226, it was four years before a correct version (renamed to JIS X 0208) could be produced. Politics, and the desire to create a shiny new simplified Japanese rather than to reflect actual needs, played a large role in the
standard and left it unable to represent many common pre-war characters and variants. This problem was then propagated to Unicode (interestingly, most users seem to blame Unicode rather than JIS now) and is still being worked through today.
Apart from the problems caused by trying to suppress older characters and variants in the name of modernity, JIS standards also suffer from competition between the three Japanese government ministries that have a role in setting language standards (the ministries of justice, industry, and culture).
JIS X 0208 has now been supplemented by other standards that do contain necessary older variants (i.e. JIS X 0213), but there is still a notable lack of any government-sponsored attempt to create a Japanese character set that actually describes the language. Academic projects such as GT Code and Mojikyo
contain enough characters to represent the classics, but are not really designed for general information processing.
Languages: Japanese
See also:JIS X 0208
Contains 5,801 kanji and over 200 other characters which supplement the JIS X 0208 set.
Languages: Japanese
See also:JIS X 0208
The JIS X 0213 standard adds old forms, variant forms, and in particular many kanji used in personal and place names to its predecessor, JIS X 0212. This makes it the first JIS standard to
have a repertoire of han characters that can actually be used to write most Japanese names. Because of these added characters, JIS X 0213 is difficult to map to Unicode -- round trip
conversion is only possible with the addition of 61 new characters to Unicode.
The difficulty in making JIS X 0213 work with Unicode illustrates the problems that can be caused by 'han unification'. The following sequence of events happens all too often:
- The Unicode standard defines a character
- Due to unification, this Unicode character actually encompasses several distinct entities (glyphs, characters, or variants).
- A need arises to actually write something correctly, using a particular glyph or variant.
- A new character has to be added to Unicode to represent this particular thing.
- There is now one character whose set of possible representations/interpretations is a subset of that of another character.
- Round-trip conversion between encodings, sorting and matching become difficult and users become confused.
It is the opinion of this humble writer that a slightly more sensitive approach to han unification in the beginning would have made this situation much rarer.
Languages: Japanese
This standard is the same as ISO 10646-1 (Unicode) in terms of character repertoire. However, the JIS standard defines some subsets:
- Basic Japanese (JIS X 0208 plus JIS X 0212)
- Non-ideographs supplement
- ideograpyhs supplement 1
- ideograpyhs supplement 2
- ideograpyhs supplement 3
- fullwidth alphanumerics
- halfwidth alphanumerics
Languages: Korean
Johab is a way (specified in KS C 5601) of describing any possible combined hangul character with three bytes (actually, with 15 bits). It is not used directly, but the range of hangul Johab describes often forms part of other specifications.
Also called: EBCDIC
Languages: Japanese
See also:IBM EBCDIC
The Japanese version of EBCDIC contains half-width katakana instead of lowercase Roman letters. Since there are far more katakana than Roman letters, the layout of characters is very odd and most versions probably do not have sensible EBCDIC 'folding' behaviour. It is extremely difficult
to imagine a purpose for which this character set is well suited, even by EBCDIC standards.
| Japanese EBCDIC (Revised) |
charset |
Also called: EBCDIC
Languages: Japanese
See also:IBM EBCDIC
This particular EBCDIC variant seems to completely abandon the 'folding' concept common to most EBCDIC variants. Instead, upper and lowercase Roman characters are arranged as in 'classic' EBCDIC and katakana are packed in around them to fill most of the available space, although there are one or two blanks.
Many IBM product lines stuck with the older Japanese EBCDIC version which has no compatibility with this one.
Languages: Japanese
Hitachi KEIS is used on Hitachi mainframe systems to represent Japanese kanji. Fullwidth alphanumeric characters apparently are compatible with a version of EBCDIC.
| KOI-8 Alternative |
charset |
Also called: KOI-8 Alternativny
Languages: Cyrillic
See also:KOI-8R
KOI-8 Alternative is an 8-bit Cyrillic character set in which Russian Cyrillic characters are encoded in alphabetical order starting at 128. It is the 'non-KOI-compatible KOI-8'.
Microsoft's CP866 is based on this set.
Languages: Cyrillic
See also:KOI-8R
This character set is an attempt to unify KOI-8R, KOI-8RU, and ISO-IR-111. It includes all Cyrillic letters, Russian or otherwise, and fills up the remaining 8-bit space with graphics characters. It is the only KOI-8 flavor to include all Cyrillic characters.
Also called: RELCOM KOI-8R
Languages: Cyrillic
See also:KOI-8R
This is a version of KOI-8R which supports five extra Ukrainian and Byelarussian characters, which replace some of the graphics characters of KOI-8R.
Also called: Ukrainian KOI-8U,
RFC 1489
Languages: Cyrillic
See also:KOI-8R KOI-8RU
This Ukrainian version of KOI-8R adds the missing Ukrainian character 'ghe with upturn', which had been suppressed by Stalin due to his general dislike of Ukrainia. Unlike KOI-8RU it does not include any Byelarussian or other non-Russian characters.
The KOI-8RU, KOI-8U, and KOI-8 Unified character sets often seem to get mixed up in people's minds (perhaps also in mine).
Also called: KS X 1001
Languages: Korean
The basic Korean character set. Contains 4,888 hanja (han characters used in Korean) and 2350 precombined hangul (Korean phonetic) characters. Also contains greek and cyrillic letters.
This character set has the peculiarity that any hanja with more than one reading appears once per reading. This makes it probably the only character set to intentionally multiply the number of han characters.
This standard defines (but does not actually add to the character set) the 'johab' system for specifying any possible combined hangul character.
Also called: KS-Roman
Languages: Korean
The Korean Standard (KS) version of ASCII. Identical to ASCII except that the dollar sign is replaced with a won sign.
Languages: Korean
A supplement to KS C 5601, this character set includes extra hanja, extra precombined hangul, and accented european (latin and greek) characters.
Languages: Korean
This set has the same repertoire of Korean characters as ISO 10646 (Unicode), and supersedes KS C 5601 and KS C 5657.
Also called: EBCDIC
Languages: Latin
See also:IBM EBCDIC
It appears that following the release of the ISO-8859 standards, IBM created a corresponding set of EBCDIC standards that represent different European regions, generally by taking a version of 'original' EBCDIC and filling in the extra characters from an ISO-8859 character set in order in the empty code points.
I confess that I have never actually seen any data presented like this, except possibly for the ISO-8859-1 variant.
Languages: Any
See also:MARC-8 MARC-21 UKMARC
MARC standards, including the Library of Congress' MARC-8 and MARC-21 and the British Library's UKMARC, primarily specify record formats used for bibliographic data. They often also specify character sets and encodings.
Languages: Any
See also:MARC-8 MARC UTF-8
MARC-21 is a collection of bibliographic record formats used by the US Library of Congress. It also specifies a character repertoire for use in these records. The characters may be encoded using
either UTF-8 or MARC-8. Since MARC-21 contains a subset of the Unicode characters, this is an example of UTF-8 being used to encode something other than Unicode.
It is important to bear in mind that the MARC-21 repertoire is a true repertoire, i.e. a list of possible characters, and not a coded character set. The code point used for a character varies according
to whether UTF-8 or MARC-8 encoding is used.
Languages: Any
See also:MARC-21 MARC
MARC-8 is a variable length character encoding used by the Library of Congress in the USA. Characters are either 8-bit or, for CJK, 24 bit. Escape sequences (consisting of control characters) are used to switch between character sets.
MARC-8 specifies several 8-bit coded character sets, e.g. for Greek, Cyrillic and graphics characters. These sets leave room not only for the character-set switching control characters, but for some control characters that have meaning in
MARC-21 records (0x88, 0x89, 0x8d, and 0x8e).
These sets form part of the MARC-21 character repertoire.
| MacCentralEuropean |
charset |
Languages: Latin
This is the 8-bit Central European character set used on Macs. It seems to represent all Roman characters used in Central European languages, without being exactly the same as ISO-8859-2.
Languages: Latin
This is the 8-bit character set traditionally used on Macs. It's repertoire of 223 characters matches neither the standard Roman set (ISO-8859-1) nor the de facto standard, Microsoft's CP1252. MacRoman omits superscript numbers, fractions, and the letters 'eth' and 'thorn' (no longer used in English but useful in Scandinavia), and includes
a number of mathematical symbols, plus the Apple logo.
No attempt has been made in this character set list to describe the various MacRoman variants as they rarely seem to have names.
| Microsoft Code Pages |
charset | encoding | family |
Also called: CP,
Codepage
Languages: Any
See also:CP932 CP1252 CP437 CP850 CP1253 CP936 CP950
Microsoft uses the term 'code page' rather vaguely to mean a character set or a character set plus an encoding. Since the days of DOS, MS has defined code pages for many locales and languages. Because MS systems have often been first to provide a given level of internationalization,
Microsoft code pages have often become de facto standards. The character repertoire of a code page is frequently formed by taking an existing standard and adding whatever extra characters are most in demand.
Code pages are designated as 'CP[number]' where the number is pretty well random. Some important code pages are:
- CP932: The Japanese code page, which specifies Microsoft's version of the Shift-JIS encoding.
- CP1252: The Latin-1 code page, usually used instead of the equivalent ISO-8859 standard (ISO-8859-1) because it has a euro symbol.
- CP437: The original DOS character set.
Languages: CJK, Chinese, Korean, Japanese, Vietnamese
Mojikyo is a product developed by the Mojikyo Institute in Japan for representing rare and scholarly Chinese characters. Technically more of a glyph set than a character set, it has an importante role because there are many characters that can only be represented electronically using Mojikyo. In addition to 'regular' han characters, Mojikyo includes:
- Chu Nom (Vietnamese versions of han characters)
- The Shui and Tangut scripts
- Divination symbols
- Korean-made 'han' characters
The Mojikyo institute provides fonts and character-finding software as well as defining a repertoire of characters (or glyphs). In fact, without this character-finding software many of the rare variants in Mojikyo would be well nigh impossible to specify.
Unlike Unicode, Mojikyo does not attempt to 'unify' han characters, which means, for instance, that the traditional, simplified, japanese, and korean versions of a character are all separate entities in Mojikyo. This has both advantages and disadvantages relative to Unicode, depending on the task. Mojikyo also considers all historical, classical, regional and obsolete versions of a character
to be seperate entities, which is tremendously useful for many literary purposes (and also a bit more predictable than unicode). Mojikyo associates a certain amount of meta-information, e.g. stroke count and radical, with each character, which allows the software component of Mojikyo to easily find characters and variants.
Although not as suited to general-purpose computing as Unicode, Mojikyo seems likely to remain an important tool for academic, linguistic and decorative use far into the future.
| NeXT International Code |
charset |
Also called: NeXTSTEP,
NIC
Languages: Latin
This character set, based on ISO-8859-1, was used in NeXTSTEP computers. It has a couple of extra characters in the high half and a couple of seemingly
random changes.
Also called: GOST-19768-87,
GOST-19768
Languages: Cyrillic
See also:KOI-8
In 1987, GOST-19768 became the new Russian government standard character set and abandoned the 'KOI Property' (decipherability when reduced to ASCII). Fierce debate has raged ever since about whether KOI-8 derived character sets are better than
those, like GOST-19768, that have characters in dictionary order.
Also called: EBCDIC
Languages: Latin
See also:IBM EBCDIC
The original, canonical form of IBM's EBCDIC character set defined an uppercase area, a punctuation area, and a lowercase area, arranged so that the uppercase could be 'folded' over the lowercase, leaving a character set with uppercase and punctuation.
This design was spoilt by several factors such as the embedding of four punctuation characters (hidden when 'folded') among the upper case characters, and the inclusion of a couple of fairly random characters in some unused higher code points.
Typically of EBCDIC, even 'original' EBCDIC is available in various dialects that differ in the placement of the punctuation marks.
Also called: Commodore ASCII
Languages: Latin
PETSCII is the version of ASCII used on Commodore computers of the 1980s. It was based on the 1963 ASCII standard (e.g. it had a left arrow instead of an underscore) but added a large number of graphics characters. It appeared in various flavors on the Commodore PET and the C64. Since the C64 is the most widely-distributed general purpose computer in history (so far), it presumably follows that
PETSCII is a widely-used character set, despite the fact that it is rather obscure. Luckily, a well-defined PETSCII to Unicode mapping exists.
Languages: Japanese
See also:DEC Kanji
This encoding is an extension of DEC Kanji which can represent the characters of JIS X 0212 as well as the DEC Kanji range. It is (or was) used on systems made by DEC.