Pegasus InfoCorp: Web site design and web software development company

UNICODE (7)

the unified 16-bit super character set

DESCRIPTION

ISO 10646

Universal Character Set (UCS) .

UCS

round-trip compatibility

UCS

UCS contains the characters required to represent almost all known languages. This includes apart from the many languages which use extensions of the Latin script also the following scripts and languages: Greek, Cyrillic, Hebrew, Arabic, Armenian, Gregorian, Japanese, Chinese, Hiragana, Katakana, Korean, Hangul, Devangari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayam, Thai, Lao, Bopomofo, and a number of others. Work is going on to include further scripts like Tibetian, Khmer, Runic, Ethiopian, Hieroglyphics, various Indo-European languages, and many others. For most of these latter scripts, it was not yet clear how they can be encoded best when the standard was published in 1993. In addition to the characters required by these scripts, also a large number of graphical, typographical, mathematical and scientific symbols like those provided by TeX, PostScript, MS-DOS, Macintosh, Videotext, OCR, and many word processing systems have been included, as well as special codes that guarantee round-trip compatibility to all other existing character set standards.

The UCS standard (ISO 10646) describes a 31-bit character set architecture, however, today only the first 65534 code positions (0x0000 to 0xfffd), which are called the Basic Multilingual Plane (BMP) have been assigned characters, and it is expected that only very exotic characters (e.g. Hieroglyphics) for special scientific purposes will ever get a place outside this 16-bit BMP.

The UCS characters 0x0000 to 0x007f are identical to those of the classic US-ASCII character set and the characters in the range 0x0000 to 0x00ff are identical to those in the ISO 8859-1 Latin-1 character set.

COMBINING CHARACTERS

UCS

combining characters .

UCS ,

UCS

IMPLEMENTATION LEVELS

UCS:

Level 1

Level 2

Level 3

UCS

The Unicode 1.1 standard published by the Unicode Consortium contains exactly the UCS Basic Multilingual Plane at implementation level 3, as described in ISO 10646. Unicode 1.1 also adds some semantical definitions for some characters to the definitions of ISO 10646.

UNICODE UNDER LINUX

BMP

wchar_t

UCS

BMP

The locale setting specifies, whether the system character encoding is for example UTF-8 or ISO 8859-1 . Library functions like wctomb, mbtowc, or wprintf can be used to transform the internal wchar_t characters and strings into the system character encoding and back.

PRIVATE AREA

BMP ,

LITERATURE

This is the official specification of UCS . Pretty official, pretty thick, and pretty expensive. For ordering information, check www.iso.ch.

There is already Unicode 1.1.4 available. The changes to the 1.0 book are available from ftp.unicode.org. Unicode 2.0 will be published again as a book in 1996.

A good reference book about the C programming language. The fourth edition now covers also the 1994 Amendment 1 to the ISO C standard (ISO/IEC 9899:1990) which adds a large number of new C library functions for handling wide character sets.

BUGS

UCS

AUTHOR

Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>

UNICODE (7)

the unified 16-bit super character set

DESCRIPTION

COMBINING CHARACTERS

IMPLEMENTATION LEVELS

UNICODE UNDER LINUX

PRIVATE AREA

LITERATURE

BUGS

AUTHOR

SEE ALSO