Latin-1 characters (0x80 - 0xFF) are encoded as two-byte by UTF-8.
| UTF-8 | ASCII | ||
| HEX | BIN | DEC | DEC |
| C2 A0 | 1100-0010 1010-0000 | 194 160 | 160 |
| C3 80 | 1100-0011 1000-0000 | 195 128 | 192 |
| C3 81 | 1100-0011 1000-0001 | 195 129 | 193 |
In some case, we might need to translate the Latin-1 characters encoded by UTF-8 back to one-byte.
First we need to know how UTF-8 encodes the Latin-1 from one-byte to two-byte. All two-byte UTF-8 characters have the following fixed encoding format. the value of 'x' depends the character being encoded.
110x,xxxx 10xx,xxxx
In the first byte, the first two '11' means that this is a two byte character. the closest followed '0' is a fixed flag for spliting the first two flag bits with the rest bits.
In the second byte, the first '10' is fixed flag to make a difference with the leading byte, for example, 110x,xxxx is a leading byte for a character. All ASCII characters is leading by a 0xxx,xxxx, so if the application reads a byte like 10xx,xxxx and can not read the byte before it, then it can abandon this byte to read next.
For an ASCII encoded Latin-1 character 0xC1, its binary encoding is:
11000001
Fowlling shows how it maps to the two-byte in UTF-8 encoding.
11000011 10000001
110 and 10 is fix flag.
000 is the fillers.
Way one
Following is the implement by DB2 SQL PL.
SET h_s = SUBSTR( str, i, 1 ) ;
SET l_s = SUBSTR( str, i + 1, 1 ) ;
SET h_s = CHR( MOD( ASCII( h_s ) * 64, 256 ) ) ;
SET l_s = CHR( MOD( ASCII( l_s ) * 4, 256 ) / 4 ) ;
RETURN CHR( ASCII( h_s ) + ASCII( l_s ) ) ;
In the code above, we use the operators *(multiple), /(divide), +(plus) and 'mod' instead of the bit-operators in C language, <<, >> and 'or'. It's the same with following C code.
h_s = h_s << 6 ;
l_s = l_s << 2 ;
l_s = l_s >> 2 ;
return h_s or l_s ;
Way two
SET h_s = SUBSTR( str, i, 1 ) ;
SET l_s = SUBSTR( str, i+1, 1 ) ;
ELSEIF (h_s = 194) THEN -- latin-1
-- 0xC2 0x80 - 0xC2 0xBF ==> 0x80 - 0xBF
RETURN l_s ;
ELSEIF (h_s = 195) THEN -- latin-1
-- 0xC3 0x80 - 0xC3 0xBF ==> 0xC0 - 0xFF, 0xC0 - 0x80 = 0x40(64)
RETURN CHR( ASCII( l_s ) + 64 ) ;
END IF;


