Friday, November 20, 2009

Character sets UTF-8 and ISO-8859-1 in HTML and PHP

Character sets can bring much agony to programming for the web. Here are some useful functions and ideas about handling charsets.

There are two main charsets: UTF-8(Unicode) and ISO-8859-1. UTF-8 lets one handle more characters than ISO-8859-1 (such as arabic and chinese characters).

HTML files that should handle unicode characters must have this set in the header:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Here one could also use 'iso-8859-1' instead.

If you need to convert between ISO-8859-1 and UTF-8 in PHP the functions utf8_encode and utf8_decode are useful. utf8_encode will convert from ISO-8859-1 to UTF-8 and utf8_decode from UTF-8 to ISO-8859-1.
The use of the functions is:
$utf8string = utf8_encode( $iso-8859-1string );
$iso-8859-1string = utf8_decode( $utf8string );

In addition to UTF-8 there are two other Unicode standards UTF-16 and UTF-32. The difference is how many bytes are used to store a character and thus how many different characters can be stored. UTF-32 is in little use, and so is UTF-16, but UTF-16 is used in more places.

To convert between these Unicode charsets use iconv. It is used like this:
$utf16string = iconv("utf-8", "utf-16", $utf8str);
$utf8string = iconv("utf-16", "utf-8", $utf16str);

incov can also be used to convert from other character sets such as ISO-8859-1 like this:
$utf16string = iconv("iso-8859-1", "utf-16", $iso-8859-1string);

Solutions found at:,,, and

No comments:

Post a Comment