UNICODE (7)
the unified 16-bit super character set
DESCRIPTION
The international standard
ISO 10646
defines the
Universal Character Set (UCS) .
UCS
contains all characters of all other character set standards. It also
guarantees
round-trip compatibility
i.e., conversion tables can be built such that no information is lost
when a string is converted from any other encoding to
UCS
and back.
UCS
contains the characters required to represent almost all known
languages. This includes apart from the many languages which use
extensions of the Latin script also the following scripts and
languages: Greek, Cyrillic, Hebrew, Arabic, Armenian, Gregorian,
Japanese, Chinese, Hiragana, Katakana, Korean, Hangul, Devangari,
Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayam,
Thai, Lao, Bopomofo, and a number of others. Work is going on to
include further scripts like Tibetian, Khmer, Runic, Ethiopian,
Hieroglyphics, various Indo-European languages, and many others. For
most of these latter scripts, it was not yet clear how they can be
encoded best when the standard was published in 1993. In addition to
the characters required by these scripts, also a large number of
graphical, typographical, mathematical and scientific symbols like
those provided by TeX, PostScript, MS-DOS, Macintosh, Videotext, OCR,
and many word processing systems have been included, as well as
special codes that guarantee round-trip compatibility to all other
existing character set standards.
The
UCS
standard (ISO 10646) describes a 31-bit character set architecture,
however, today only the first 65534 code positions (0x0000 to 0xfffd),
which are called the
Basic Multilingual Plane (BMP)
have been assigned characters, and it is expected that only very
exotic characters (e.g. Hieroglyphics) for special scientific purposes
will ever get a place outside this 16-bit BMP.
The
UCS
characters 0x0000 to 0x007f are identical to those of the classic
US-ASCII
character set and the characters in the range 0x0000 to 0x00ff
are identical to those in the
ISO 8859-1 Latin-1
character set.
COMBINING CHARACTERS
Some code points in
UCS
have been assigned to
combining characters .
These are similar to the non-spacing accent keys on a typewriter. A
combining character just adds an accent to the previous character.
The most important accented characters have codes of their own in
UCS ,
however, the combining character mechanism allows to add accents and other
diacritical marks to any character. The combining characters always follow
the character which they modify. For example, the German character
Umlaut-A ("Latin capital letter A with diaeresis") can either be represented
by the precomposed
UCS
code 0x00c4, or alternatively as the combination of a normal "Latin
capital letter A" followed by a "combining diaeresis": 0x0041 0x0308.
IMPLEMENTATION LEVELS
As not all systems are expected to support advanced mechanisms like
combining characters, ISO 10646 specifies the following three
implementation levels of
UCS:
Level 1
Combining characters and Hangul Jamo characters (a special,
more complicated encoding of the Korean script, where Hangul syllables
are coded as two or three subcharacters) are not supported.
Level 2
Like level 1, however in some scripts, some combining characters are
now allowed (e.g. for Hebrew, Arabic, Devangari, Bengali, Gurmukhi,
Gujarati, Oriya, Tamil, Telugo, Kannada, Malayalam, Thai and Lao).
Level 3
All
UCS
characters are supported.
The Unicode 1.1 standard published by the Unicode Consortium contains
exactly the
UCS Basic Multilingual Plane
at implementation level 3, as described in ISO 10646. Unicode 1.1 also
adds some semantical definitions for some characters to the
definitions of ISO 10646.
UNICODE UNDER LINUX
Under Linux, only the
BMP
at implementation level 1 should be used at the moment, in order to
keep the implementation complexity of combining characters low. The
higher implementation levels are more suitable for special word
processing formats, but not as a generic system character set. The C
type
wchar_t
is on Linux an unsigned 16-bit integer type and its values are
interpreted as
UCS
level 1
BMP
codes.
The locale setting specifies, whether the system character encoding is
for example
UTF-8
or
ISO 8859-1 .
Library functions like
wctomb,
mbtowc,
or
wprintf
can be used to transform the internal
wchar_t
characters and strings into the system character encoding and back.
PRIVATE AREA
In the
BMP ,
the range 0xe000 to 0xf8ff will never be assigned any characters by
the standard and is reserved for private usage. For the Linux
community, this private area has been subdivided further into the
range 0xe000 to 0xefff which can be used individually by any end-user
and the Linux zone in the range 0xf000 to 0xf8ff where extensions are
coordinated among all Linux users. The registry of the characters
assigned to the Linux zone is currently maintained by H. Peter Anvin
<Peter.Anvin@linux.org>, Yggdrasil Computing, Inc. It contains some
DEC VT100 graphics characters missing in Unicode, gives direct access
to the characters in the console font buffer and contains the
characters used by a few advanced scripts like Klingon.
LITERATURE
*
Information technology - Universal Multiple-Octet Coded Character
Set (UCS) - Part 1: Architecture and Basic Multilingual Plane.
International Standard ISO 10646-1, International Organization
for Standardization, Geneva, 1993.
This is the official specification of
UCS .
Pretty official, pretty thick, and pretty expensive. For ordering
information, check www.iso.ch.
*
The Unicode Standard - Worldwide Character Encoding Version 1.0.
The Unicode Consortium, Addison-Wesley,
Reading, MA, 1991.
There is already Unicode 1.1.4 available. The changes to the 1.0
book are available from ftp.unicode.org. Unicode 2.0 will
be published again as a book in 1996.
*
S. Harbison, G. Steele. C - A Reference Manual. Fourth edition,
Prentice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.
A good reference book about the C programming language. The fourth
edition now covers also the 1994 Amendment 1 to the ISO C standard
(ISO/IEC 9899:1990) which adds a large number of new C library
functions for handling wide character sets.
BUGS
At the time when this man page was written, the Linux libc support for
UCS
was far from complete.
AUTHOR
Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
SEE ALSO
|
|