|Date||Sun, 18 Feb 2007 09:40:21 +0430|
|Subject||Re: [greenstone-users] UTF-8?|
|As I understand it UTF-8 is not a superset of ISO-8859-1. In UTF-8 there is a direct mapping to 7-bit ASCII but not the 8th bit which contains the accented characters of the Latin alphabet. This is because of the design of UTF-8, The first bit(s) of a UTF-8 character is always used to describe how many bytes are required to fully describe the UNICODE character. So for a one byte character the first bit is always 0, for a 2 byte character the first three bits are 110 (the first two bits of the subsequent byte is 10), that means that 5 bits of the 16 available are used in identifying the type and for a simple error checking, 11 bits are available (excluding the 128 bit range available for 1 byte encoding) that allows 1920 characters to be encoded using the 2 byte scheme. UTF-8 continues into three and four byte encodings.
In ISO-8859-1 à has an encoding E0,
where as in UTF-8 it will be C3A0
E0 = 11100000
E0 = 00011 100000 (as 11 bits)
E0 = yyyyy zzzzzz
Convert to a two byte UTF-8 is given by 110yyyyy 10zzzzzz
Thus it equals:
11000011 10100000 = C3 A0
Sorry if my explanation is too much, but in short UTF-8 and IS08859-1 are different animals (when it gets to the higher bit, which is where the accented characters reside).
On 2/17/07, Julian Fox <firstname.lastname@example.org> wrote: