Updated: Fri, 2006-08-25 09:34

Introduction

The purpose of this list is this: given the name of a character set, find out a little bit about it.
For each character set, the following information is stored:
  • The 'main' name of the character set. Where possible, this is the name of the standard it is defined in.
  • Other names, nicknames, and aliases by which the set is known.
  • Whether the 'character set' is actually a character set or an encoding system, or both. In some cases the entry is for a family of character sets.
  • Names of related character sets
  • A description of the character set
There is a table of contents which contains 'real' names and aliases.

Issues

This document takes a very vague view of what constitutes a character set, and uses terms roughly according to normal average usage rather than according to correct usage. For instance, what is called a 'character set' throughout is actually almost always a 'coded character set', a distinction the ISO themselves manage to forget not infrequently.

For a (basic) discussion of character-set related issues and terms, my Unicode tutorial could be useful.

This document presumably contains errors and leaves out countless character sets and encoding systems. Please get in touch with me to correct my failings.


Contents:

All names and nicknames in alphabetical order:
7-bit ISO 2022
ACII
ANSI X 3.4
ANSI Z39.64
ASCII
ASCII-1963
ASCII-1968
Atari ASCII
ATASCII
Adobe Standard Encoding
Augmented EBCDIC
BLECS
BTRON
Baudot
Big5
Big5+
Big5-ETen
Big5-GCCS
Big5-HKSCS
British Library Exchange Character Set
C-DAC
CCCII
CNS 11643
CP
CP1250
CP1251
CP1252
CP1253
CP1254
CP1255
CP1256
CP1257
CP1258
CP437
CP737
CP775
CP850
CP852
CP855
CP857
CP860
CP861
CP862
CP863
CP864
CP865
CP866
CP869
CP874
CP932
CP936
CP949
CP950
Codepage
Commodore ASCII
Cyrillic EBCDIC
DBCS
DBCS Host
DBCS PC
DBCS-EUC
DEC Kanji
DEC-MCS
DG-International
DGI
DIS-8859-5
DIS-8859-5
DOSArabic
DOSCanadaF
DOSGreek2
DOSHebrew
DOSIcelandic
DOSNordic
DOSPortuguese
DoD 8-bit Code
DosBaltRim
DosCyrillic
DosCyrillicRussian
DosGreek
DosLatin1
DosLatin2
DosLatinUS
DosTurkish
EACC
EBCDIC
ECMA-6
ECMA-Cyrillic
ELOT-927
ELOT-928
EUC
EUC-CN
EUC-JP
EUC-KR
EUC-TW
Early ASCII
FIELDATA
GB 12050
GB 12052
GB 13000-1
GB 13000-1.93
GB 13134
GB 16959
GB 18030
GB 1988
GB 2312
GB 2312-80
GB 7589
GB 7590
GB 8045
GB Internal Code
GB-Roman
GB/T 12345
GB/T 12345-90
GB/T 13131
GB/T 13131-9X
GB/T 13132
GB/T 13132-9X
GB1
GB2
GB3
GB5
GBK
GCCS
GOST-13052
GOST-19768
GOST-19768-87
GT Code
GT Font
GTCode
GTCode
HK SCS-200
HP-Roman
HP-Roman8
HZ
HZ-GB-2312
IBM CP 437
IBM DBCS
IBM DBCS-EUC
IBM EBCDIC
IBM Modern Greek
IS 13194
ISCII
ISFOC
ISO-10646
ISO-2022
ISO-2022-CN
ISO-2022-CN-EXT
ISO-2022-JP
ISO-2022-JP-2
ISO-2022-KR
ISO-646
ISO-8859
ISO-8859-1
ISO-8859-10
ISO-8859-11
ISO-8859-13
ISO-8859-14
ISO-8859-15
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-IR-111
ITA2
International ASCII
JEF
JIS C 6226
JIS X 0201
JIS X 0208
JIS X 0212
JIS X 0213
JIS X 0221
JIS-Roman
JISCII
JOHAB
Japanese EBCDIC
Japanese EBCDIC (Revised)
KEIS
KOI-7
KOI-8
KOI-8 Alternative
KOI-8 Alternativny
KOI-8 Unified
KOI-8R
KOI-8RU
KOI-8U
KS C 5601
KS C 5636
KS C 5657
KS C 5657
KS C 5700
KS X 1001
KS-Roman
Latin/Greek
Latin1
Latin10
Latin2
Latin3
Latin4
Latin8
Latin9
Localized EBCDIC
MARC
MARC-21
MARC-8
MacCentralEuropean
MacRoman
Microsoft Code Pages
Mojikyo
NIC
NeXT International Code
NeXTSTEP
New GOST-19768
OEM437
Original EBCDIC
Original GOST-19768
PC-ISCII
PETSCII
REACC
RELCOM KOI-8R
RFC 1489
RFC 1642
SJIS
Shift-JIS
Super DEC Kanji
TCVN5712-1
TIS-1074
TIS-620
TRON
TRON Code
TRS-80 Character Set
UCS-2
UCS-4
UHC
UKMARC
UTF-16
UTF-2
UTF-32
UTF-7
UTF-8
Ukrainian KOI-8U
Unicode
Unified Hangul Code
VISCII
VN1
VN2
VN5712-1
VNCII
VPS
VSCII
WinArabic
WinGreek
zW

7-bit ISO 2022  encoding  family 

Also called: ISO-2022
Languages: Any
See also:ISO-2022-JP ISO-2022-JP-2 ISO-2022-KR ISO-2022-CN ISO-2022-CN-EXT

As well as defining the EUC family of encodings for Unix, the ISO 2022 standard defines a set of 7-bit encodings presumably intended for mainframes. These include Japanese, Korean, and Chinese encodings.

7-bit ISO 2022 is an extensible type of encoding, of which certain specializations for particular languages are actually used (e.g. ISO-2022-JP for Japanese). ISO 2022 encoding has the following properties:

  • The encoding can represent many character sets, which may be 1-byte or 2-byte
  • Escape sequences mark the transition from one character set to another. If you need to use a given character set in a 7-bit ISO 2022 encoding, there must be an ISO escape sequence registered for that set.
  • The encoding respects the space traditionally reserved for control characters, so there are 94 possible 1-byte characters and 94*94 possible 2-byte characters. Character sets larger than this cannot be used.
A large number of escape sequences is registered, but still not so many that every useful character set can be ISO 2022 encoded. Escape sequences should begin 'ESC (' for single-byte sets and 'ESC $ (' for double-byte sets but this rule does not seem to be followed particularly closely.

Designed for use in emails, the ISO-2022 family of encodings would be a good starting point for anyone wondering why the world needed Unicode.


ACII  charset 

Also called: PC-ISCII
Languages: Pan-Indian
See also:ISCII

ACII is a variant of ISCII designed for increased compatibility with PC 8-bit character sets that contain graphics characters. ACII includes some box-drawing characters at the expense of some of the less popular ISCII characters (e.g. digits).
ANSI X 3.4  charset 

Also called: ASCII, ASCII-1968, ECMA-6
Languages: Latin
See also:ASCII-1963

This is the good old ASCII we know and love. The ANSI X 3.4 standard specified not only the ASCII character set but a whole series of rules relating to representation on punched tape and so on, now mercifully forgotten.

ASCII made many improvements on ITS2 and FIELDATA, it's predecessors. In particular it included a large number of control codes, it tried to include a superset of all the characters available in the telegraphy character sets of the time, and it made at least some attempt to lay out punctuation in sensible blocks.

The original 1963 version of ASCII had no lowercase letters and a different array of control characters than the ASCII we are used to. The 1967 version of the standard created modern ASCII. Since most character sets since then have been designed with ASCII somewhere in mind, the various quirks of 1967 ASCII have become quirks not only of most character sets, but of the way computer engineers think of characters. Some of these quirks are:

  • ASCII contains not just characters in the linguistic sense, but characters which represent formatting information (e.g. vertical tab, linefeed)
  • Even more strikingly, ASCII contains not just 'content' characters such as text and formatting, but 'connection' characters intended for error checking and teleprinter control. This is a result of the way ASCII was used at the time and of the teletype legacy.
  • It was decided that accent marks would be represented in ASCII by designating certain characters as 'diacritical marks' which, when printed over a regular letter, would create a new letter, much as you might type a letter, go back over it and type an accent on a mechanical typewriter.
  • Certain other characters were designated 'national use' characters which could be replaced when desired by accented characters. Thus, despite its small size, ASCII managed to have two totally separate systems for producing accented characters.
ASCII is a 7-bit coded character set. When used on 8-bit computers, there is always the question of what to do with the extra 128 code points that become available if the 8th bit it used. Most attempts to represent European languages on computers have focused on assigning various characters to the 128 upper code points of ASCII, creating various ASCII-compatible 8-bit character sets. These sets are mostly standardized in the ISO-8859 standard.

Note on etiquette: 8-bit character sets based on ASCII should, theoretically, avoid assigning 'printing characters' (i.e. characters that are actual language characters as opposed to control codes) to code points that are the 8-bit equivalent of ASCII control characters. This is for the benefit of 7-bit machines and systems that may strip the 8th bit from a character. Some manufacturers have paid more attention to this rule than others, depending on how much they needed the extra code points and how much they cared about legacy 7-bit computer systems. Notably, most Microsoft encoding systems produce bytes that violate this rule, and many 8-bit character sets for languages with many accents (e.g. Vietnamese) assign code points in violation of this rule. Many would say that this is a good example of the point at which backwards compatibility becomes not worth maintaining.


ANSI Z39.64  charset 

Also called: REACC, EACC
Languages: CJK, Chinese, Korean, Japanese
See also:CCCII

This standard was created by the Research Libraries Group (a non profit academic organization) based on CCCII. Structure is the same as CCCII but there are some corrections and tweaks:
  • Some rare characters and variants were dropped
  • Some characters are now considered mere variants
  • The simplified layer is now reserved for offically 'simplified' forms rather than any variant that just happens to be simpler than the main form
  • More kana, kokuji (Japan-only kanji) and hangul were added

ASCII-1963  charset 

Also called: Early ASCII
Languages: Latin


The original, 1963 version of ASCII specified only uppercase letters but was otherwise similar to modern ASCII. The range now occupied by lowercase letters was undefined. There were various other differences, especially in control characters and in the inclusion of left and up (but not right and down!) arrow symbols.
ATASCII  charset 

Also called: Atari ASCII
Languages: Latin


Atari ASCII was used on Atari's line of 8-bit home computers and was thus reasonably widespread in the 80s. It is a fairly unconventional and difficult ASCII variant. In particular:
  • With two exceptions, the upper 128 characters are simply the graphical inverse of the lower 128, and thus not really characters so much as alternate display forms. The exceptions are the end-of-line character and bell character, which both do have their high bit set.
  • ATASCII uses the control character area of ASCII (below 0x20) to hold graphics characters, and stores control characters elsewhere.
As with PETSCII, the character set incorporated terminal control codes including cursor movement, and it was thus possible to create animations consisting of an ATASCII string.

Adobe Standard Encoding  charset 

Languages: Latin


This is an 8-bit character set used by Adobe in PostScript. The lower half is as ever the same as ASCII; the upper half contains a scattering of typographic and accented characters.
Augmented EBCDIC  charset 

Also called: EBCDIC
Languages: Latin
See also:IBM EBCDIC

This is the name given to EBCDIC versions formed by taking the original EBCDIC dialect and filling in the other characters from ISO-8859-1. For some strange reason, all EBCDIC dialects formed in this way seem to leave code point 0x155 empty.

It appears that there are several Augmented EBCDIC dialects, one for each dialect of original EBCDIC. Presumably they differ in punctuation.


Big5  charset  encoding 

Languages: Chinese


Big5 is a character set originating in Taiwan, used to write traditional Chinese. The name comes from the five companies that collaborated to create it. It specifies just over 13,000 hanzi.

The term 'Big5' has been abused quite a lot over the years. The original Big5 character repertoire is no longer used and the name 'Big5' usually means one of the many extensions. The main extensions are Microsoft's CP950, Big5-ETen, and Big5-HKSCS.

Big5 is both a character set and an encoding. As an encoding, it is a DBCS encoding with lead bytes in the range 0xa1-0xf90 and trails 0x40-0xfe.


Big5+  charset  encoding 

Languages: Chinese


This is the largest extension of Big5. It is not currently well supported (if indeed it is supported at all), but it is theoretically the largest Big5 variant. Big5+ uses code points that clash with the GCCS and HKSCS Hong Kong oriented Big5 variants.
Big5-ETen  charset  encoding 

Also called: Big5
Languages: Chinese


An extension to Big5 by ETen Information Systems, Big5-ETen has kana, cyrillic characters and circled digits, although not at the same code points as other Big5 extensions. Big5-ETen is a superset of CP950.
Big5-HKSCS  charset  encoding 

Also called: HK SCS-200
Languages: Chinese


This extension to Big5 is a superset of CP950. It adds characters used in Hong Kong. In 1999 it replaced the HK government's older GCCS extension to Big5.
CCCII  charset 

Languages: CJK, Chinese, Korean, Japanese
See also:EACC

Developed in Taiwan, CCCII (stands for Chinese Character Code for Information Interchange) is a very comprehensive system for representing all characters found in all forms of Chinese, Japanese and Korean.

CCCII is composed of 94 planes, which are 94x94 code points. A 'layer' is made up of six planes. The layers are occupied as follows:

  • 1: Symbols and traditional han characters
  • 2: Simplified han characters
  • 3-12: Variant han character forms
  • 13: Japanese kana and kokuji (Japanese-only kanji)
  • 14: Korean hangul
  • 15: Reserved
  • 16: Misc. Korean and Japanese characters
The first twelve planes have a special relationship. A given code point corresponds to a given han character, no matter which layer is being used. The code point on layer 1 is occupied by the traditional form, on layer 2 the same code point is occupied by the simplified form, and on higher layers the variants are stored.

CCCII has some structure within the layers, as well. Within layer one, hanzi are divided into three groups based on rarity. Within layer two, there is a distinction between PRC-specific simplifications and generic simplifications.

CCCII is ideal for bibliographic and scholarly purposes but not much used elsewhere. It also has some problems, including many repeated characters and the fact that no commodity software can read it. EACC is a subset of CCCII that removes many problematic characters.


CNS 11643  charset  encoding 

Languages: Chinese


The CNS (Chinese National Standard) character set and encoding system is an extremely comprehensive system for representing Traditional Chinese, encoded with 3 bytes per character.

The original 1986 standard defined 16 planes of 94x94 characters each. Planes 1 and 2 contained Big5 characters (but not in Big5 order), plane 14 contained user characters.

The 1992 standard filled in many other planes. 1 and 2 are still Big5, but the others are as follows:

  • 3: About 6000 hanzi from the original plane 14
  • 4: About 7000 rare hanzi, and some hanzi from Unicode that were not included in Big5
  • 5: 8600 rare hanzi
  • 6: 6388 variant forms with up to 14 strokes
  • 7: 6539 variant forms with over 14 strokes, perhaps the most nightmarishly difficult set of han characters ever encoded
An eighth plane of about 7000 even more abstruse characters is thought to be under development.

Although CNS 11643 is the national standard of Taiwan, Big5 is much more common in practise.


CP1250  charset 

Languages: Latin


Microsoft's extension of ISO-8859-2. Mysteriously, it's number is lower than that of Microsoft's ISO-8859-1 variant.
CP1251  charset 

Languages: Cyrillic
See also:KOI-8R

This is the Microsoft code page for Cyrillic. It is available in two flavors, Standard and Russian. Both focus on including the largest possible number of Cyrillic characters, even more than KOI-8 Unified. Both character ordering and graphics characters are sacrificed but the result is the richest repertoire of Cyrillic characters in any 8-bit character set. The 'Russian' flavor includes accented characters.

CP1251 seems to have replaced KOI-8R as the most common Cyrillic character set.


CP1252  charset 

Languages: Latin


Microsoft extension of ISO-8859-1 (Latin1). Has a euro symbol.
CP1253  charset 

Also called: WinGreek
Languages: Greek
See also:ISO-8859-7

The Microsoft extention to ISO-8859-7 has some troubling incompatibilities; notably, the capital-alpha-with-tonos is on a different code point. As with many Microsoft code pages, code points in the 0x80 to 0x9f control character range have been assigned to printable characters. Tut.
CP1254  charset 

Languages: Latin, Turkish


The Microsoft code page for Turkish. Based on ISO-8859-3.
CP1256  charset 

Also called: WinArabic
Languages: Arabic


This is Microsoft's modified version of ISO-8859-6. There are considerable differences, in that Microsoft try to preserve ISO-8859-1 compatibility by putting accented letters and symbols in their Latin1 positions and then filling in Arabic characters around them.
CP1257  charset 

Languages: Latin


The Microsoft code page for Baltic languages. Based on ISO-8859-4.
CP1258  charset 

Languages: Vietnamese
See also:VISCII

This is the Microsoft character set/encoding for Vietnamese. It is based on the TCVN5712 standard but with some minor changes, perhaps so as to be more compatible with Latin1.
CP437  charset 

Also called: DosLatinUS, OEM437, IBM CP 437
Languages: Latin


The character set used by American DOS versions, specified by IBM. It included a few accent marks and many, many graphics characters. In particular, it included graphics characters below 0x20 (traditionally the control code area).
CP737  charset 

Also called: DosGreek
Languages: Greek


The DOS Greek 8-bit character set.
CP775  charset 

Also called: DosBaltRim
Languages: Latin


The DOS Baltic character set.
CP850  charset 

Also called: DosLatin1
Languages: Latin


Later DOS versions used CP850 instead of CP437. CP850 had the Latin1 (ISO-8859-1) repertoire, but positioned so as to be compatible with CP437.
CP852  charset 

Also called: DosLatin2
Languages: Latin


The Latin2 repertoire, ordered so as to be compatible with the original DOS character set, CP437.
CP855  charset 

Also called: DosCyrillic
Languages: Cyrillic
See also:KOI-8

The original cyrillic character set for DOS, ordered so as to be compatible with the original DOS character set, CP437. It was therefore not KOI-8 compatible.
CP857  charset 

Also called: DosTurkish
Languages: Latin


The DOS Turkish character set.
CP860  charset 

Also called: DOSPortuguese
Languages: Latin


The DOS Portuguese character set.
CP861  charset 

Also called: DOSIcelandic
Languages: Latin


The DOC Icelandic character set.
CP862  charset 

Also called: DOSHebrew
Languages: Hebrew


The DOS Hebrew character set.
CP863  charset 

Also called: DOSCanadaF
Languages: Latin


The DOS French Canadian character set.
CP864  charset 

Also called: DOSArabic
Languages: Arabic


The DOS Arabic character set.
CP865  charset 

Also called: DOSNordic
Languages: Latin


The DOS Scandinavian character set.
CP866  charset 

Also called: DosCyrillicRussian
Languages: Cyrillic
See also:KOI-8 Alternative CP855

A KOI-8 Alternativny based set of cyrillic characters used on DOS. It replaced the KOI-incompatible CP855.
CP869  charset 

Also called: DOSGreek2, IBM Modern Greek
Languages: Greek
See also:CP737

This alternative DOS Greek standard replaced the earlier CP737 with a repertoire and ordering based on IBM usage.
CP874  charset 

Languages: Thai


Microsoft's CP874 code page is based on TIS-620, the usual 8-bit Thai set, but adds some extra characters in unused code points.
CP932  charset  encoding 

Also called: Shift-JIS, SJIS
Languages: Japanese


CP932 is Microsoft's favored way of representing Japanese (at least up until the rise of XML and UTF-*). It is a combination of the JIS X 0201 and JIS X 0208 character sets together with an encoding system whereby all the 8bit code points that do not represent hiragana are used as lead bytes for kanji.

Unlike EUC-JP, Shift-JIS is not ASCII compatible (code points that should be control codes are used as lead bytes) and nor is it particularly simple to process. It is also impossible to represent the JIS X 0212 kanji in this encoding scheme.


CP936  charset  encoding 

Languages: Chinese


This is Microsoft's favored Chinese encoding/character set combination. It is an extension of EUC-CN that covers all Unicode han characters.
CP949  charset  encoding 

Also called: UHC, Unified Hangul Code
Languages: Korean


This is Microsoft's favored way of representing Korean. It is a derivative of EUC-KR, extended to include all johab precomposed hangul. Like other east asian Microsoft encodings, it allows ASCII trail bytes, and lead bytes in the control code range, thus losing a form of ASCII compatibility.
CP950  charset  encoding 

Also called: Big5
Languages: Chinese


CP950 is Microsoft's version of Big5, usually referred to as 'Big5' in Microsoft environments. It incorporates various extensions to the original Big5 character set.

CP950 defines a block of characters in the range 0xF9D6-0xF9FE. All the other Big5 extensions keep this range of characters and are therefore supersets of CP950. This makes CP950 a common choice when deciding what variant of Big5 to support, although the official Taiwanese standard would be CNS 11643 encoded as EUC-TW.


Cyrillic EBCDIC  charset 

Also called: EBCDIC
Languages: Cyrillic
See also:IBM EBCDIC

Cyrillic EBCDIC abandons the lowercase Roman letters to make way for a rather abbreviated list of Cyrillic characters. Other characters are then added in the punctuation area, and the whole thing is structured so that when 'folded' in IBM punched card style, the resulting character set contains upper case Cyrillic and some (but not all) punctuation.
DEC Kanji  encoding 

Languages: Japanese
See also:Super DEC Kanji

This encoding system was developed by Digital Equipment Corporation to represent Japanese. It can encode the JIS X 0201 and JIS X 0208 character sets. DEC Kanji also allowed about 2000 user-defined characters. It is obsolete compared to Super DEC Kanji.
DEC-MCS  charset 

See also:ISO-8859-1

DEC-MCS was the 'Multinational Character Set' used in DEC's vt220 terminals. It formed the basis of, and is a subset of, the more famous ISO-8859-1 set. The Latin letters eth and thorn, the international currency symbol, and a couple of other punctuatoin marks were added to make ISO-8859-1.
DG-International  charset 

Also called: DGI


The DG-International character set was formerly used with the DG Interactive Cobol environment. It included ASCII and 69 extra characters.
DIS-8859-5  charset 

Also called: KOI-8R, RFC 1489
Languages: Cyrillic
See also:KOI-8

KOI-8R was a character set proposed in the 1980's by the Demos company. It was based on KOI-8 but replaced non-Russian characters with graphics characters, and added the 'dotted e' character which was missing from KOI-8. The former change was not terribly popular but the latter was necessary, so many KOI-8 variants and hacks were made that included a 'dotted e'. The term KOI-8R often seems to be used to mean 'KOI-8 with a dotted e'. KOI-8R was, and may still be, the most widely used Cyrillic character set.
ELOT-927  charset 

Languages: Greek
See also:ISO-8859-7

This Greek character set was uppercase-only. It was superseded by ELOT-928, which in turn was standardized as ISO-8859-7.
EUC  encoding  family 

Languages: CJK, Japanese, Chinese, Korean
See also:EUC-JP EUC-TW EUC-CN EUC-KR

The EUC encoding systems are a group of encodings for CJK character sets. They were defined in ISO-2022 for use in 8-bit systems (i.e. Unix as opposed to mainframes).

EUC stands for Extended Unix Code and the system has been primarily used on Unix. EUC encodings allow the use of four 'code sets', of which set 0 is always the local equivalent of ASCII (e.g. JIS X 0201 for Japanese encoding). The other three may be unused or may correspond to a particular character set that is being EUC encoded.

The four flavors of EUC encoding in use are EUC-JP, EUC-CN, EUC-KR, and EUC-TW.


EUC-CN  encoding 

Languages: Chinese
See also:EUC

This is the EUC encoding for simplified Chinese. The code sets are:
  • Set 1: GB 1988
  • Set 2: GB 2312
  • Set 3: unused
  • Set 4: unused

EUC-JP  encoding 

Languages: Japanese
See also:EUC

This is the EUC encoding for Japanese. The code sets are assigned as follows:
  • Set 1: JIS X 0201 (i.e. Roman)
  • Set 2: JIS X 0208
  • Set 3: Half-width katakana
  • Set 4: JIS X 0212
The presense of half-width katakana in this encoding (although not as part of any common character set) extends its repertoire to be equivalant to that of Microsoft's Shift-JIS. EUC, however, is better behaved in that it does not use control-character codes illegally.

EUC-KR  encoding 

Languages: Korean
See also:EUC

This is the Korean EUC encoding. The code sets are:
  • Set 1: KS C 5636 (Roman)
  • Set 2: KS C 5601
  • Set 3: unused
  • Set 4: unused

EUC-TW  encoding 

Languages: Chinese
See also:EUC

This is the EUC encoding for traditional (Taiwanese) Chinese. The code sets are:
  • Set 1: ASCII
  • Set 2: CNS 11643 Plane 1
  • Set 3: CNS 11643 Planes 1-16
  • Set 4: unused
Code set 2 takes less space to encode in EUC, so the duplication of CNS 11643 Plane 1 allows common characters to be represented more concisely.

FIELDATA  charset 

Also called: DoD 8-bit Code
Languages: Latin
See also:ASCII

FIELDATA is a character set used in the Cold War-era US military. It was in some ways the ancestor of ASCII. Over 128 code points, it distributes upper and lowercase Roman letters, a rather miserly allocation of punctuation, the numerals, and a large number of control codes. (Although in fact, FIELDATA predates the concept of a code point).

FIELDATA may still be in use in some 60's era computers.


GB 12050  charset 

Languages: Chinese, Uighur


This character set contains 70 primary and 72 supplementary characters for writing the Uighur script, an Arabic-derived script.
GB 12052  charset 

Languages: Chinese, Korean


The official PRC standard for the Korean script.

This set is identical to the Chinese basic set (GB 2312) up until row 9 -- in other words, latin, greek, kana, bopomofo and pinyin characters are all the same. The sole exception is that the currency sign is not a yuan sign (nor even a won sign) but a dollar sign.

In subsequent rows, about 5000 pre-combined hangul are defined, although the ordering is unlike Korean standards. There are also 94 hanja, which are 'idu', ancient han-character-based phonetic characters, rather than the kind used in Korea today.


GB 13000-1  charset 

Also called: GB 13000-1.93
Languages: Chinese


This is the Chinese version of ISO 10646 (Unicode). It is identical to the ISO specification and is kept in sync with it.
GB 13134  charset 

Languages: Chinese, Yi


This character set is a double byte 94x94 representation of the Yi script (an ideographic script used in Sichuan, Yunnan, Guizhou and Guangzi).
GB 16959  charset 

Languages: Chinese, Tibetan


This character set includes 169 Tibetan letters, digits, symbols, and control codes. The symbols include astronomical and mathematical symbols. Both Tibetan characters and the characters used to indicate Sanskrit transliteration are included, so the total character repertoire is likely as large as the number of surviving Tibetans.
GB 18030  charset 

Languages: Chinese


Until recently, han characters added to unicode were added to the GB 13000 standard (i.e. the Chinese reflection of the Unicode standard) and to GBK, the character set for normal use. However, GBK ran out of code points and was unable to represent the 6,502 han characters of CJK Unified Ideographs Extension A when those characters arrived in Unicode 3.0.

GB 18030 was therefore created to represent all the Unicode 3.0 hanzi. It is compatible with GBZ and the now-aged GB 2312 set, yet covers all Unicode code points. It is not yet as widely used as the older sets, however.


GB 1988  charset 

Also called: GB-Roman
Languages: Chinese


The ASCII variant of mainland China, identical to ASCII but for the dollar sign, which is replaced with a yuan sign.

GB stands for Guo Biao (National Standard), and indicates an official People's Republic of China standard.


GB 2312  charset 

Also called: GB Internal Code, GB 2312-80
Languages: Chinese


GB 2312 is the basic Simplified Chinese character set. It has a strong resemblance to JIS X 0208, the basic Japanese character set. In particular, it includes kana, greek, and cyrillic characters in the same area, and divides han characters into two levels, with level 1 arranged by reading and level 2 ordered by radical and stroke count.

GB 2312 may be represented in either 7-bit form or 8-bit form, depending on whether compatibility with 7-bit systems is more important than distinguishing Chinese characters from ASCII characters. If GB 2312 is being represented in 8-bit form, the high bit of each byte is set to 1. This effectively creates a new character set. The combination of this set with ASCII is known as 'GB Internal Code'.

GB 2312 is usually encoded in either the HZ or EUC-CN systems.


GB 7589  charset 

Also called: GB2
Languages: Chinese


A set of 7237 supplementary hanzi for GB 2312. Also known as GB2.
GB 7590  charset 

Languages: Chinese


A set of 7039 supplementary hanzi for GB 2312. Also known as GB4.
GB 8045  charset 

Languages: Chinese, Mongolian


This character set contains 94 characters representing the post-Revolution Mongolian alphabet. This is the last vertical-only writing system left, and is distantly descended from Sanskrit via Uighur. It has been extensively normalized in recent times.
GB/T 12345  charset 

Also called: GB1, GB/T 12345-90
Languages: Chinese


GB 12345 is the traditional equivalent of the simplified character set GB 2312. It is used for representing traditional mainland Chinese as opposed to traditional Taiwanese Chinese. It is also called 'GB1'.
GB/T 13131  charset 

Also called: GB3, GB/T 13131-9X
Languages: Chinese


The traditional Chinese version of GB 7589. Also known as GB3.
GB/T 13132  charset 

Also called: GB5, GB/T 13132-9X
Languages: Chinese


The traditional Chinese version of GB 7590. Also known as GB5.
GBK  charset  encoding 

Also called: CP936
Languages: Chinese


GBK is both a character set and an encoding. As a character set, it is a superset of GB 2312, and includes traditional as well as simplified hanzi. The encoding is variable length, with one byte for ASCII and two for GBK characters.

GBK was created because of a need to include the extra Unicode characters from GB 13000 in a GB 2312 compatible coded character set. Therefore, in GBK the characters of GB 2312 occupy their original code points and the GB 13000 characters are fitted in around them.

Microsoft's CP936 is actually another name for GBK.


GCCS  charset  encoding 

Also called: Big5-GCCS
Languages: Chinese


This extension to Big5 was developed by the Hong Kong government (it stands for Government Chinese Character Set). It introduced Japanese kana, some simplified hanzi and variant glyphs, and most importantly Hong Kong placenames to Big5. It is now superseded by Big5-HKSCS.
GOST-13052  charset 

Languages: Cyrillic


GOST-13052 was an old Russian Cyrillic 7-bit character set. Being 7-bit it had to store characters on top of the ASCII range. Ingeniously, Cyrillic letters were assigned code points in such a way to to correspond to ASCII letters of the opposite case. Thus, when GOST text was viewed as ASCII, it was just barely understandable, and could be distinguished from ASCII by the fact that words tended to start with a lowercase letter and continue with uppercase ones.

The property of a Cyrillic character set being readable when viewed as ASCII, or easily transformed into ASCII, persisted in the KOI-* family of Cyrillic coded character sets.


GOST-19768  charset 

Also called: KOI-7, KOI-8, Original GOST-19768
Languages: Cyrillic
See also:GOST-13052 New GOST-19768

The GOST-19768 standard defined two character sets, KOI-7 and KOI-8. KOI-7 is a 7-bit character set that included only capital Roman letters and has not had much impact on history.

KOI-8 became the basis of more than 20 years of Cyrillic character sets. It was an 8-bit set, with ASCII characters in the low half and cyrillic in the high half. It had the property inherited from the earlier GOST-13052 character set, that stripping the high bit from the Cyrillic characters would make them somewhat readable as ASCII.

KOI-8 was often used in a slight extended form, with the 'dotted e' character added at points 0xa3 and 0xb3. This character had been left out in GOST-19768.

The 1987 version of GOST-19768 changed the character ordering completely and has a separate entry in this list.


GTCode  charset 

Also called: GT Code, GT Font
Languages: Japanese
See also:Mojikyo

GT Code is a coded character set for Japanese kanji, which also contains a large amount of meta information to assist in kanji searching and categorizing. Like the similar but more widespread Mojikyo, GT Code is more of a glyph set than a character set in many ways. GT Code contains about 70,000 entries, far more than Unicode but less than Mojikyo. Unlike Mojikyo, however, GT Code contains only kanji, so it may be the largest set of kanji electronically collected.

GT Code is a product of the Tokyo University Multilingual Research Society. It is intended more as a database of information about characters than as a way of representing text in bulk.


HP-Roman8  charset 

Also called: HP-Roman
Languages: Latin


This 8-bit ASCII-compatible character set was used by Hewlett-Packard on their HPUX OS and HPTerm terminals. It contains various Western European accented characters.
HZ

Also called: HZ-GB-2312
Languages: Chinese


HZ is a system usually used to encode GB 2312-80, or one of its many variants. It is exactly like ISO 2022 7-bit encoding, but the 'escape sequences' that are characteristic of that kind of encoding are strings of ASCII characters instead. Specifically, the tilde is used as an escape character.
IBM DBCS  encoding 

Also called: DBCS, DBCS PC, DBCS Host
Languages: All


IBM DBCS is the double-byte system used on many IBM systems (those that aren't restricted to EBCDIC). There are two very different flavors:
  • DBCS-PC: In practise, this system represents Japanese as Shift-JIS and Korean as EUC-KR.
  • DBCS-Host: This uses markers to shift between 1 and 2 byte modes, and can represent any set of characters with 16 bit code points.
DBCS-PC actually specifies only the double-byte part of a multibyte (i.e. variable length characters) encoding system. The user has to pick a single-byte character set to use with IBM DBCS.

DBCS-Host uses EBCDIC as the character set for single-byte characters.


IBM DBCS-EUC  encoding 

Also called: DBCS-EUC
Languages: All


IBM developed DBCS-EUC for representing CJK characters on AIX. It is closely related to EUC encoding.
IBM EBCDIC  charset  family 

Also called: EBCDIC
Languages: Latin
See also:Localized EBCDIC Original EBCDIC Augmented EBCDIC Japanese EBCDIC Cyrillic EBCDIC

EBCDIC is an encoding, or rather a large family of related encodings, used by IBM. EBCDIC is 8-bit, but unlike most 8-bit encodings it does not have a lower half similar to ASCII and an upper half customized for local needs. Rather, characters are placed according to the historical needs of punched card machines. This results in the Roman alphabet being stored in several non-contiguous regions.

EBCDIC is legendary for its complexity, its multitude of incompatible dialects, and the way almost every implementation cheerfully ignores most relevant rules.

The original non-contiguous character layout of EBCDIC was rooted in the idea that the two halves of the character set, when superimposed, could form a smaller yet still useful character set. Many later EBCDIC versions break this requirement.

EBCDIC stands for 'Extended Binary Coded Decimal Information Code', a name that seems to make sense until you read it again more slowly.


IS 13194  charset 

Also called: ISCII
Languages: Pan-Indian


Indian Script Code For Information Interchange (ISCII) emerged in 1993 as the standard 8-bit character set for India. India's wealth of languages has always posed unique challenges, and the effect on ISCII has been that unlike other national 8-bit standards, this character set has a very strong distinction between characters and glyphs. The high (non-ASCII) half of ISCII specifies about 80 characters which can be used with Devanagari or other glyphs to write various languages. Many code points in ISCII do not map directly to a displayed glyph but are interpreted according to the glyph set being used. For example, there are meta-characters that indicate a bare vowel or an 'alternative' glyph, the interpretation of these terms being left up to the software displaying the text.

In many ways ISCII is more like a code that requires an interpreter to render it into readable glyphs than like a conventional character set.

ISCII can be used with the following glyph sets:

  • Devanagari
  • Gujarati
  • Gurmukhi
  • Oriya
  • Bengali
  • Assamese
  • Telugu
  • Kannada
  • Malayalam
  • Tamil
...which is also the range of Indian scripts available in Unicode. Because the Indian area of Unicode is based on ISCII, some of the dummy characters and metacharacters that were needed in ISCII are now enshrined in Unicode, even though they do not correspond to any actual language entity. This is one of the problems with the way Unicode was first compiled...

ISCII is perhaps the most interesting and ingenious of the 8-bit ASCII-based character sets. It is also the hardest to use because the renderer must resolve many ligatures, diacritical marks and other things that are only hinted at in the ISCII byte stream, *and* do so for more than one glyph set!


ISFOC  charset  encoding 

Also called: C-DAC
Languages: Pan-Indian
See also:ISCII

Although the original ISCII standard was able to represent most Indian languages intelligibly, its limited number of code points could not express, even with metacharacters, the amount of information needed for Indian language processing, leaving most decisions at the mercy of the text rendering agent -- usually the font. C-DAC, a company, therefore developed the ISFOC (Indian Standard Font Code) which standardises the rendering of the text and also serves as an encoding scheme and character set, thus eliminating the role of ISCII.

ISFOC (Intelligence Based Script Font Code, an acronym that doesn't seem to fit very well) is a coded character set containing all the basic elements required for rendering an Indian script. ISFOC 'character' are not characters or even linguistically recognizable entities of any kind; they are elements which are combined jigsaw-style to build up a glyph. Seperate ISFOC sets exist for the different Indian scripts, and sets exist for scripts like Tibetan that are not covered by ISCII. However, they are unified by the fact that algorithms (ISFA) are defined to convert each one to and from ISCII.

ISFOC allows 188 entities per script, which is not enough to display some scripts optimally. It also only allows the display of one script at once. Because of this, and because the full repertoire of ISFOC and ISCII is in Unicode, Unicode will probably eventually become the most popular way to represent Indian language text.


ISO-10646  charset 

Also called: Unicode
Languages: Any
See also:UTF-7 UTF-8 UTF-16 UTF-32 UCS-2 UCS-4

The development of the mighty ISO-10646 or 'Unicode' character set is perhaps the most significant development ever in internationalization. The aim of Unicode is nothing less than to contain every character in the world, and while there are many well-discussed flaws in Unicode it is already an invaluable character set for many languages. With Unicode the dream of being able to process text without thinking about the particular keyboard it was typed in from took a step closer to reality.

Unicode is managed by the Unicode Consortium, a large and diverse group that is something like the opposite of the World Wide Web Consortium, in that it puts out standards at an annoyingly slow rate.

Although Unicode is primarily a character set, the Unicode standard actually contains a wealth of other information, including character types and widths and normalization data. This latter is very important because of the high level of duplicate characters, variant characters, and combining characters in Unicode.

Unicode has various problems, which can be briefly summarized thus:

  • The original strategy was to include existing character sets in Unicode wholesale. This results in many duplicate characters, or characters that are non-linguistic but were included in earlier character sets for convenience.
  • For the same reason, many characters that had obscure technical uses in their original character sets are present in Unicode even though they have no meaning outside their original set.
  • Some groups of characters, notably hanzi/hanja/kanji, were 'unified' meaning that variants deemed to be the same root character were given the same code point. This caused various problems, especially with Japanese names, and as a result more blocks of characters containing variants had to be added later. This has resulted in a very vague notion of what constitutes a 'character' in CJK ideographs, and a certain amount of bad feeling.
  • Unicode includes both combining characters (accents and base characters seperately) and pre-combined characters. This means that some text can be represented in many, many ways in Unicode and makes normalization a huge and difficult enterprise.
  • Although it was originally stated that Unicode would store characters, and only characters, not glyphs or other entities, in practise there is no strong distinction between characters, variant characters, glyphs, and variant glyphs. This is especially true of CJK ideographs.
  • The process of adding new groups of characters to Unicode is very, very slow, and mistakes (as with Runic unification) are difficult to ever correct.
Despite these issues, most would agree that Unicode is a tremendously useful tool. Furthermore, the Unicode standard specifies a number of encodings, at least one of which is suitable for practically any environment, be it 7-bit mainframes, 8-bit Unix, or modern environments such as Java or .NET.

Those scripts that did not develop any character set or encoding standards before the advent of Unicode will almost certainly wind up using Unicode. This includes Khmer and the African scripts (Tifinagh, Ethiopic etc) as well as many historical scripts. Support for cuneiform and Linear B is, however, sadly still far away.

Because the process for adding new ranges of characters to ISO-10646 is extremely slow, there are large numbers of scripts whose most formal computer representation is as an 'Annex' to the ISO-10646 standard. In some cases these Annexes resemble independant character sets (which are waiting to have Unicode code points allocated to them and thus to become coded character sets).


ISO-2022-CN  encoding 

Languages: Chinese
See also:7-bit ISO 2022 ISO-2022-CN-EXT

This specialization of 7-bit ISO-2022 encoding is used for Chinese. Like ISO-2022-KR, regular ISO escape sequences are eschewed in favor of shift sequences. These shift sequences do not toggle the character stream from one set to another, but are used before every single character. The following character sets are supported:
  • ASCII
  • GB 2312
  • CNS 11643 Plane 1
  • CNS 11643 Plane 2
Thus both simplified and traditional characters can be represented.

Any line on which a character from a given set (other than ASCII) appears must be marked with a 'designator sequence' indicating that set. This is vaguely similar to ISO-2022-KR, except that in KR the designator need only appear once per file.

In sum, ISO-2022-CN bears no resemblance to the theoretical generic ISO-2022 encoding or to anything that a sane waking human could be expected to imagine. This sort of encoding is the reason that, with all its faults, we should be very very grateful for Unicode.


ISO-2022-CN-EXT  encoding 

Languages: Chinese
See also:7-bit ISO 2022 ISO-2022-CN

This ISO-2022 encoding extends ISO-2022-CN by adding support for about a dozen more character sets, including GB/T 12345, planes 3 to 7 of CNS 11653, and GB 7590. Each of these character sets has an ISO-registered designation sequence, as demanded by the rules of ISO-2022-CN.
ISO-2022-JP  encoding 

Languages: Japanese
See also:7-bit ISO 2022

This specialization of 7-bit ISO-2022 encoding is used for Japanese. The permitted character sets are
  • ASCII
  • JIS X 0201
  • JIS C 6226
  • JIS X 0208
This is a very limited set indeed, which is why ISO-2022-JP-2 is used instead.

ISO-2022-JP-2  encoding 

Languages: Japanese
See also:7-bit ISO 2022 ISO-2022-JP

This specialization of 7-bit ISO-2022 encoding is used for Japanese. It includes more character sets than the earlier ISO-2022-JP standard, to wit:
  • JIS X 0212
  • GB 2312
  • KS C 5601
In other words, it includes common Chinese and Korean characters as well as Japanese ones. This standard was introduced before Chinese and Korean had ISO-2022 encodings of their own.

ISO-2022-KR  encoding 

Languages: Korean
See also:7-bit ISO 2022

This specialization of 7-bit ISO-2022 encoding is used for Korean. It permits only two character sets (ASCII and KS C 5601). Furthermore, it defines a 'designator sequence', an escape sequence that must appear in any document in which non-ASCII characters occur, before the first non-ASCII character. Furthermore, the 'escape sequences' used to switch between ASCII and Korean characters are not actually escape sequences (they do not start with an escape character).

These changes reflect the needs of email systems, and ISO-2022-KR has been in widespread use since 1991 in Korean emails.


ISO-646  charset 

Also called: International ASCII
Languages: Latin


The ISO-646 standard specified several national versions of ASCII, i.e. ASCII-like 7 bit character sets. These generally replaced the less-used characters in traditional ASCII with accent marks, local currency symbols, and what have you. All character sets specified in ISO-646 are made obsolete by those in ISO-8859 (which in turn should really be considered obsoleted by ISO-10646, Unicode).
ISO-8859  charset  family 

Languages: Any
See also:ISO-8859-1 ISO-8859-2 ISO-8859-3 ISO-8859-4 ISO-8859-5 ISO-8859-6 ISO-8859-7 ISO-8859-8 ISO-8859-9 ISO-8859-10 ISO-8859-11 ISO-8859-12 ISO-8859-13

ISO 8859 is a standard that specifies a large number of 8-bit character sets. The lower (7-bit) half of each set is ASCII. The upper half contains a set of characters suited for a particular range of languages; for instance ISO 8859-2 handles central and eastern European languages that use Roman characters.

Because ISO 8859 character sets are small, they do not handle CJK (Chinese/Japanese/Korean) characters. There is an ISO 8859 standard for practically every other script, though, with more still under consideration. ISO 8859 has been an important standard for Latin languages (i.e. those using Roman characters) in particular, but is probably losing ground to Unicode now, since Unicode makes it possible to represent all the characters in all ISO 8859 sets at once.


ISO-8859-1  charset 

Also called: Latin1
Languages: Latin
See also:ISO-8859

This is the ubiquitous Latin-1 character set. It handles all western European languages, and as an added bonus it also handles all African languages except Bantu languages.
ISO-8859-10  charset 

Also called: Latin10
Languages: Latin
See also:ISO-8859

This is derived from ISO-8859-4; it drops Latvian support and adds Lapp and Icelandic support, thus becoming the ISO-8859 charset for Scandinavia.
ISO-8859-11  charset 

Also called: TIS-620
Languages: Thai


TIS-620 is the Thai character set used in Thailand (other Thai dialects may be represented in differenct character sets). It is in the process of being approved as ISO-8859-11.

TIS-620 is an 8-bit set of which the low half is of course ASCII. The baht (currency) sign is put in the high half, rather than replacing the ASCII dollar.

All Thai characters are also present in Unicode and a TIS-620 to Unicode mapping is not difficult.


ISO-8859-13  charset 

Languages: Latin
See also:ISO-8859

This is the (provisional) ISO-8859 standard for the Baltic. It has the Latvian characters that were lost in Latin6.
ISO-8859-14  charset 

Also called: Latin8
Languages: Latin
See also:ISO-8859

This is the ISO-8859 standard for Celtic languages. It includes a UK pound sign.
ISO-8859-15  charset 

Also called: Latin9
Languages: Latin


This is the (provisional) replacement for ISO-8859-1. It removes some less-used symbols and adds French and Finnish letters. It also replaces the international currency sign with a Euro sign (Latin1 lacks a euro sign, hence the popularity of Microsoft's CP1252).
ISO-8859-2  charset 

Also called: Latin2
Languages: Latin
See also:ISO-8859

8-bit character set for central and eastern (non-Cyrillic) europe.
ISO-8859-3  charset 

Also called: Latin3
Languages: Latin
See also:ISO-8859

The ISO-8859 8-bit character set for esperanto, maltese, and turkish.
ISO-8859-4  charset 

Also called: Latin4
Languages: Latin
See also:ISO-8859

The ISO-8859 8-bit character set for Baltic languages.
ISO-8859-5  charset 

Languages: Cyrillic
See also:ISO-8859 KOI-8

The ISO 8859 8-bit character set for Cyrillic. It consisted of a rearrangement of the characters in ISO-IR-111 into non-KOI-style positions (i.e. from ASCII-compatible to alphabetic order). However, due to non-Russian Cyrillic characters being inserted in odd places, the ordering is not actually alphabetically correct, so this particular ISO-8859 standard seems not to be used much.
ISO-8859-6  charset 

Languages: Arabic
See also:ISO-8859

The ISO-8859 8-bit Arabic character set.
ISO-8859-7  charset 

Also called: ELOT-928, Latin/Greek
Languages: Greek
See also:ISO-8859

The ISO-8859 8-bit Greek character set.
ISO-8859-8  charset 

Also called: CP1255
Languages: Hebrew
See also:ISO-8859

The ISO-8859 8-bit Hebrew character set. Microsoft's CP1255 is exactly the same.
ISO-IR-111  charset 

Also called: ECMA-Cyrillic
Languages: Cyrillic
See also:KOI-8

This set is a compromise between ISO-8859 (specifically, ISO-8859-5) and the KOI family of Cyrillic character sets. It kept the KOI character order for Russian letters, and added Ukrainian, Byelorussian, and other non-Russian characters in the empty code points.
ITA2  charset 

Also called: Baudot
Languages: Latin


The 'International Telegraph Alphabet 2' was used on some extremely early computer equipment.
JEF  encoding 

Languages: Japanese


JEF (Japanese Enhanced Feature) is an encoding system for kanji used on Fujitsu systems.
JIS X 0201  charset 

Also called: JIS-Roman, JISCII
Languages: Japanese


The oldest Japanese character set standard, JIS X 0201 contains two groups of characters: JIS-Roman and half-width katakana.

The main difference between JIS-Roman and ASCII is the in JIS-Roman there is a yen symbol instead of a backslash. This oddity persists in modern-day fonts, to the point where a yen sign is regarded by many as an acceptable alternative glyph for the backslash character.

Half-width katakana are the minimal set of katakana used in ATMs, with the consonant strength markers as seperate characters. A small number of Japanese punctuation characters is included in the katakana area.


JIS X 0208  charset 

Also called: JIS C 6226
Languages: Japanese
See also:Mojikyo JIS X 0213

Basic Japanese character set with kana, greek, roman and cyrillic characters. Contains 6,355 kanji, divided into two levels. Kanji in the first level are arranged according to reading, while the second level kanji are arranged by radical and stroke count.

JIS X 0208 has been a troubled standard. Originally published in 1978 as the error-packed JIS C 6226, it was four years before a correct version (renamed to JIS X 0208) could be produced. Politics, and the desire to create a shiny new simplified Japanese rather than to reflect actual needs, played a large role in the standard and left it unable to represent many common pre-war characters and variants. This problem was then propagated to Unicode (interestingly, most users seem to blame Unicode rather than JIS now) and is still being worked through today.

Apart from the problems caused by trying to suppress older characters and variants in the name of modernity, JIS standards also suffer from competition between the three Japanese government ministries that have a role in setting language standards (the ministries of justice, industry, and culture).

JIS X 0208 has now been supplemented by other standards that do contain necessary older variants (i.e. JIS X 0213), but there is still a notable lack of any government-sponsored attempt to create a Japanese character set that actually describes the language. Academic projects such as GT Code and Mojikyo contain enough characters to represent the classics, but are not really designed for general information processing.


JIS X 0212  charset 

Languages: Japanese
See also:JIS X 0208

Contains 5,801 kanji and over 200 other characters which supplement the JIS X 0208 set.


JIS X 0213  charset 

Languages: Japanese
See also:JIS X 0208

The JIS X 0213 standard adds old forms, variant forms, and in particular many kanji used in personal and place names to its predecessor, JIS X 0212. This makes it the first JIS standard to have a repertoire of han characters that can actually be used to write most Japanese names. Because of these added characters, JIS X 0213 is difficult to map to Unicode -- round trip conversion is only possible with the addition of 61 new characters to Unicode. The difficulty in making JIS X 0213 work with Unicode illustrates the problems that can be caused by 'han unification'. The following sequence of events happens all too often:
  • The Unicode standard defines a character
  • Due to unification, this Unicode character actually encompasses several distinct entities (glyphs, characters, or variants).
  • A need arises to actually write something correctly, using a particular glyph or variant.
  • A new character has to be added to Unicode to represent this particular thing.
  • There is now one character whose set of possible representations/interpretations is a subset of that of another character.
  • Round-trip conversion between encodings, sorting and matching become difficult and users become confused.
It is the opinion of this humble writer that a slightly more sensitive approach to han unification in the beginning would have made this situation much rarer.

JIS X 0221  charset 

Languages: Japanese


This standard is the same as ISO 10646-1 (Unicode) in terms of character repertoire. However, the JIS standard defines some subsets:
  • Basic Japanese (JIS X 0208 plus JIS X 0212)
  • Non-ideographs supplement
  • ideograpyhs supplement 1
  • ideograpyhs supplement 2
  • ideograpyhs supplement 3
  • fullwidth alphanumerics
  • halfwidth alphanumerics

JOHAB  charset 

Languages: Korean


Johab is a way (specified in KS C 5601) of describing any possible combined hangul character with three bytes (actually, with 15 bits). It is not used directly, but the range of hangul Johab describes often forms part of other specifications.
Japanese EBCDIC  charset 

Also called: EBCDIC
Languages: Japanese
See also:IBM EBCDIC

The Japanese version of EBCDIC contains half-width katakana instead of lowercase Roman letters. Since there are far more katakana than Roman letters, the layout of characters is very odd and most versions probably do not have sensible EBCDIC 'folding' behaviour. It is extremely difficult to imagine a purpose for which this character set is well suited, even by EBCDIC standards.
Japanese EBCDIC (Revised)  charset 

Also called: EBCDIC
Languages: Japanese
See also:IBM EBCDIC

This particular EBCDIC variant seems to completely abandon the 'folding' concept common to most EBCDIC variants. Instead, upper and lowercase Roman characters are arranged as in 'classic' EBCDIC and katakana are packed in around them to fill most of the available space, although there are one or two blanks. Many IBM product lines stuck with the older Japanese EBCDIC version which has no compatibility with this one.
KEIS  encoding 

Languages: Japanese


Hitachi KEIS is used on Hitachi mainframe systems to represent Japanese kanji. Fullwidth alphanumeric characters apparently are compatible with a version of EBCDIC.
KOI-8 Alternative  charset 

Also called: KOI-8 Alternativny
Languages: Cyrillic
See also:KOI-8R

KOI-8 Alternative is an 8-bit Cyrillic character set in which Russian Cyrillic characters are encoded in alphabetical order starting at 128. It is the 'non-KOI-compatible KOI-8'. Microsoft's CP866 is based on this set.
KOI-8 Unified  charset 

Languages: Cyrillic
See also:KOI-8R

This character set is an attempt to unify KOI-8R, KOI-8RU, and ISO-IR-111. It includes all Cyrillic letters, Russian or otherwise, and fills up the remaining 8-bit space with graphics characters. It is the only KOI-8 flavor to include all Cyrillic characters.
KOI-8RU  charset 

Also called: RELCOM KOI-8R
Languages: Cyrillic
See also:KOI-8R

This is a version of KOI-8R which supports five extra Ukrainian and Byelarussian characters, which replace some of the graphics characters of KOI-8R.
KOI-8U  charset 

Also called: Ukrainian KOI-8U, RFC 1489
Languages: Cyrillic
See also:KOI-8R KOI-8RU

This Ukrainian version of KOI-8R adds the missing Ukrainian character 'ghe with upturn', which had been suppressed by Stalin due to his general dislike of Ukrainia. Unlike KOI-8RU it does not include any Byelarussian or other non-Russian characters. The KOI-8RU, KOI-8U, and KOI-8 Unified character sets often seem to get mixed up in people's minds (perhaps also in mine).
KS C 5601  charset 

Also called: KS X 1001
Languages: Korean


The basic Korean character set. Contains 4,888 hanja (han characters used in Korean) and 2350 precombined hangul (Korean phonetic) characters. Also contains greek and cyrillic letters.

This character set has the peculiarity that any hanja with more than one reading appears once per reading. This makes it probably the only character set to intentionally multiply the number of han characters.

This standard defines (but does not actually add to the character set) the 'johab' system for specifying any possible combined hangul character.


KS C 5636  charset 

Also called: KS-Roman
Languages: Korean


The Korean Standard (KS) version of ASCII. Identical to ASCII except that the dollar sign is replaced with a won sign.
KS C 5657  charset 

Languages: Korean


A supplement to KS C 5601, this character set includes extra hanja, extra precombined hangul, and accented european (latin and greek) characters.
KS C 5700  charset 

Languages: Korean


This set has the same repertoire of Korean characters as ISO 10646 (Unicode), and supersedes KS C 5601 and KS C 5657.
Localized EBCDIC  charset 

Also called: EBCDIC
Languages: Latin
See also:IBM EBCDIC

It appears that following the release of the ISO-8859 standards, IBM created a corresponding set of EBCDIC standards that represent different European regions, generally by taking a version of 'original' EBCDIC and filling in the extra characters from an ISO-8859 character set in order in the empty code points.

I confess that I have never actually seen any data presented like this, except possibly for the ISO-8859-1 variant.


MARC  charset  family 

Languages: Any
See also:MARC-8 MARC-21 UKMARC

MARC standards, including the Library of Congress' MARC-8 and MARC-21 and the British Library's UKMARC, primarily specify record formats used for bibliographic data. They often also specify character sets and encodings.
MARC-21  charset 

Languages: Any
See also:MARC-8 MARC UTF-8

MARC-21 is a collection of bibliographic record formats used by the US Library of Congress. It also specifies a character repertoire for use in these records. The characters may be encoded using either UTF-8 or MARC-8. Since MARC-21 contains a subset of the Unicode characters, this is an example of UTF-8 being used to encode something other than Unicode.

It is important to bear in mind that the MARC-21 repertoire is a true repertoire, i.e. a list of possible characters, and not a coded character set. The code point used for a character varies according to whether UTF-8 or MARC-8 encoding is used.


MARC-8  charset 

Languages: Any
See also:MARC-21 MARC

MARC-8 is a variable length character encoding used by the Library of Congress in the USA. Characters are either 8-bit or, for CJK, 24 bit. Escape sequences (consisting of control characters) are used to switch between character sets.

MARC-8 specifies several 8-bit coded character sets, e.g. for Greek, Cyrillic and graphics characters. These sets leave room not only for the character-set switching control characters, but for some control characters that have meaning in MARC-21 records (0x88, 0x89, 0x8d, and 0x8e).

These sets form part of the MARC-21 character repertoire.


MacCentralEuropean  charset 

Languages: Latin


This is the 8-bit Central European character set used on Macs. It seems to represent all Roman characters used in Central European languages, without being exactly the same as ISO-8859-2.
MacRoman  charset 

Languages: Latin


This is the 8-bit character set traditionally used on Macs. It's repertoire of 223 characters matches neither the standard Roman set (ISO-8859-1) nor the de facto standard, Microsoft's CP1252. MacRoman omits superscript numbers, fractions, and the letters 'eth' and 'thorn' (no longer used in English but useful in Scandinavia), and includes a number of mathematical symbols, plus the Apple logo.

No attempt has been made in this character set list to describe the various MacRoman variants as they rarely seem to have names.


Microsoft Code Pages  charset  encoding  family 

Also called: CP, Codepage
Languages: Any
See also:CP932 CP1252 CP437 CP850 CP1253 CP936 CP950

Microsoft uses the term 'code page' rather vaguely to mean a character set or a character set plus an encoding. Since the days of DOS, MS has defined code pages for many locales and languages. Because MS systems have often been first to provide a given level of internationalization, Microsoft code pages have often become de facto standards. The character repertoire of a code page is frequently formed by taking an existing standard and adding whatever extra characters are most in demand.

Code pages are designated as 'CP[number]' where the number is pretty well random. Some important code pages are:

  • CP932: The Japanese code page, which specifies Microsoft's version of the Shift-JIS encoding.
  • CP1252: The Latin-1 code page, usually used instead of the equivalent ISO-8859 standard (ISO-8859-1) because it has a euro symbol.
  • CP437: The original DOS character set.

Mojikyo  charset 

Languages: CJK, Chinese, Korean, Japanese, Vietnamese


Mojikyo is a product developed by the Mojikyo Institute in Japan for representing rare and scholarly Chinese characters. Technically more of a glyph set than a character set, it has an importante role because there are many characters that can only be represented electronically using Mojikyo. In addition to 'regular' han characters, Mojikyo includes:
  • Chu Nom (Vietnamese versions of han characters)
  • The Shui and Tangut scripts
  • Divination symbols
  • Korean-made 'han' characters
The Mojikyo institute provides fonts and character-finding software as well as defining a repertoire of characters (or glyphs). In fact, without this character-finding software many of the rare variants in Mojikyo would be well nigh impossible to specify.

Unlike Unicode, Mojikyo does not attempt to 'unify' han characters, which means, for instance, that the traditional, simplified, japanese, and korean versions of a character are all separate entities in Mojikyo. This has both advantages and disadvantages relative to Unicode, depending on the task. Mojikyo also considers all historical, classical, regional and obsolete versions of a character to be seperate entities, which is tremendously useful for many literary purposes (and also a bit more predictable than unicode). Mojikyo associates a certain amount of meta-information, e.g. stroke count and radical, with each character, which allows the software component of Mojikyo to easily find characters and variants.

Although not as suited to general-purpose computing as Unicode, Mojikyo seems likely to remain an important tool for academic, linguistic and decorative use far into the future.


NeXT International Code  charset 

Also called: NeXTSTEP, NIC
Languages: Latin


This character set, based on ISO-8859-1, was used in NeXTSTEP computers. It has a couple of extra characters in the high half and a couple of seemingly random changes.
New GOST-19768  charset 

Also called: GOST-19768-87, GOST-19768
Languages: Cyrillic
See also:KOI-8

In 1987, GOST-19768 became the new Russian government standard character set and abandoned the 'KOI Property' (decipherability when reduced to ASCII). Fierce debate has raged ever since about whether KOI-8 derived character sets are better than those, like GOST-19768, that have characters in dictionary order.
Original EBCDIC  charset 

Also called: EBCDIC
Languages: Latin
See also:IBM EBCDIC

The original, canonical form of IBM's EBCDIC character set defined an uppercase area, a punctuation area, and a lowercase area, arranged so that the uppercase could be 'folded' over the lowercase, leaving a character set with uppercase and punctuation. This design was spoilt by several factors such as the embedding of four punctuation characters (hidden when 'folded') among the upper case characters, and the inclusion of a couple of fairly random characters in some unused higher code points.

Typically of EBCDIC, even 'original' EBCDIC is available in various dialects that differ in the placement of the punctuation marks.


PETSCII  charset 

Also called: Commodore ASCII
Languages: Latin


PETSCII is the version of ASCII used on Commodore computers of the 1980s. It was based on the 1963 ASCII standard (e.g. it had a left arrow instead of an underscore) but added a large number of graphics characters. It appeared in various flavors on the Commodore PET and the C64. Since the C64 is the most widely-distributed general purpose computer in history (so far), it presumably follows that PETSCII is a widely-used character set, despite the fact that it is rather obscure. Luckily, a well-defined PETSCII to Unicode mapping exists.
Super DEC Kanji  encoding 

Languages: Japanese
See also:DEC Kanji

This encoding is an extension of DEC Kanji which can represent the characters of JIS X 0212 as well as the DEC Kanji range. It is (or was) used on systems made by DEC.
TCVN5712-1  charset