UTF (-)
8 - an ASCII compatible multibyte Unicode encoding
DESCRIPTION
The
Unicode
character set occupies a 16-bit code space. The most obvious
Unicode encoding (known as
UCS-2 )
consists of a sequence of 16-bit words. Such strings can contain as
parts of many 16-bit characters bytes like '\\0' or '/' which have a
special meaning in filenames and other C library function parameters.
In addition, the majority of UNIX tools expects ASCII files and can't
read 16-bit words as characters without major modifications. For these
reasons,
UCS-2
is not a suitable external encoding of
Unicode
in filenames, text files, environment variables, etc. The
ISO 10646 Universal Character Set (UCS)
a superset of Unicode, occupies even a 31-bit code space and the obvious
UCS-4
encoding for it (a sequence of 32-bit words) has the same problems.
The
UTF-8
encoding of
Unicode
and
UCS
does not have these problems and is the way to go for using the
Unicode
character set under Unix-style operating systems.
PROPERTIES
The
UTF-8
encoding has the following nice properties:
*
UCS
characters 0x00000000 to 0x0000007f (the classical
US-ASCII
characters) are encoded simply as bytes 0x00 to 0x7f (ASCII
compatibility). This means that files and strings which contain only
7-bit ASCII characters have the same encoding under both
ASCII
and
UTF-8 .
*
All
UCS
characters > 0x7f are encoded as a multibyte sequence
consisting only of bytes in the range 0x80 to 0xfd, so no ASCII
byte can appear as part of another character and there are no
problems with e.g. '\\0' or '/'.
*
The lexicographic sorting order of
UCS-4
strings is preserved.
*
All possible 2^31 UCS codes can be encoded using
UTF-8 .
*
The bytes 0xfe and 0xff are never used in the
UTF-8
encoding.
*
The first byte of a multibyte sequence which represents a single non-ASCII
UCS
character is always in the range 0xc0 to 0xfd and indicates how long
this multibyte sequence is. All further bytes in a multibyte sequence
are in the range 0x80 to 0xbf. This allows easy resynchronization and
makes the encoding stateless and robust against missing bytes.
*
UTF-8
encoded
UCS
characters may be up to six bytes long, however
Unicode
characters can only be up to three bytes long. As Linux uses only the
16-bit
Unicode
subset of
UCS ,
under Linux,
UTF-8
multibyte sequences can only be one, two or three bytes long.
ENCODING
The following byte sequences are used to represent a character. The
sequence to be used depends on the UCS code number of the character:
0x00000000 - 0x0000007F:
0x00000080 - 0x000007FF:
0x00000800 - 0x0000FFFF:
1110 xxxx
10 xxxxxx
10 xxxxxx
0x00010000 - 0x001FFFFF:
11110 xxx
10 xxxxxx
10 xxxxxx
10 xxxxxx
0x00200000 - 0x03FFFFFF:
111110 xx
10 xxxxxx
10 xxxxxx
10 xxxxxx
10 xxxxxx
0x04000000 - 0x7FFFFFFF:
1111110 x
10 xxxxxx
10 xxxxxx
10 xxxxxx
10 xxxxxx
10 xxxxxx
The
xxx
bit positions are filled with the bits of the character code number in
binary representation. Only the shortest possible multibyte sequence
which can represent the code number of the character can be used.
EXAMPLES
STANDARDS
ISO 10646, Unicode 1.1, XPG4, Plan 9.
AUTHOR
Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>
SEE ALSO
|
|