Pegasus InfoCorp: Web site design and web software development company

UTF (-)

8 - an ASCII compatible multibyte Unicode encoding

DESCRIPTION

Unicode

UCS-2 )

UCS-2

Unicode

ISO 10646 Universal Character Set (UCS)

UCS-4

The UTF-8 encoding of Unicode and UCS does not have these problems and is the way to go for using the Unicode character set under Unix-style operating systems.

PROPERTIES

UTF-8

UCS

US-ASCII

ASCII

UTF-8 .

UCS

UCS-4

UTF-8 .

UTF-8

UCS

UTF-8

UCS

Unicode

UCS ,

UTF-8

ENCODING

0x00000000 - 0x0000007F:

0 xxxxxxx

0x00000080 - 0x000007FF:

110 xxxxx

10 xxxxxx

0x00000800 - 0x0000FFFF:

1110 xxxx

10 xxxxxx

0x00010000 - 0x001FFFFF:

11110 xxx

10 xxxxxx

0x00200000 - 0x03FFFFFF:

111110 xx

10 xxxxxx

0x04000000 - 0x7FFFFFFF:

1111110 x

10 xxxxxx

The xxx bit positions are filled with the bits of the character code number in binary representation. Only the shortest possible multibyte sequence which can represent the code number of the character can be used.

EXAMPLES

Unicode

11000010 10101001 = 0xc2 0xa9

and character 0x2260 = 0010 0010 0110 0000 (the "not equal" symbol) is encoded as:

11100010 10001001 10100000 = 0xe2 0x89 0xa0

STANDARDS

ISO 10646, Unicode 1.1, XPG4, Plan 9.

AUTHOR

Markus Kuhn <mskuhn@cip.informatik.uni-erlangen.de>

UTF (-)

8 - an ASCII compatible multibyte Unicode encoding

DESCRIPTION

PROPERTIES

ENCODING

EXAMPLES

STANDARDS

AUTHOR

SEE ALSO