Speak In Unicode With Php And MySql

The English language has a very simple alphabet that contains the basic lattin letters only. In the current information technology world, it can be considered as an advantage, because we can be pretty sure that every computer will be able to display and process it correctly. However, there are much more people who use different languages with more complex alphabets and writing systems. For a long time they have to deal with proper character encoding and incompatibilites between different formats. The solution for them is Unicode, the character encoding standard that aims to cover all the languages and symbols. In this post, I’m going to show, how to use Unicode with PHP and MySQL applications.

Character encoding history

Currently, the base of nearly all character encoding systems is ASCII code that assigns 7-bit binary codes to 128 basic symbols: latin letters, numbers, punctation marks and nonprintable control characters such as “new line”. The codes are very important. For computers and machines, the text is only a sequence of bits. ASCII code assures us that if they encounter the binary sequence ‘0100 0111’ (decimal 71), they display ‘G’ symbol and so on. ASCII is perfect for writing computer programs and English texts, because it is universal. Unfortunately, in the times it was developed, programmers were not bothering about such marginal issues as other languages. It was impossible to use French or German-specific characters, and we could actually forget about writing in Russian.

This led to the concept of “character encodings”. The ASCII standard used 7-bit codes, but the bytes and words in most platforms were at least 8-bit long. Different standard bodies simply used the extra bit to extend the code from 128 to 256 symbols, adding national characters and some extra symbols. Now Frech speakers could use their é letter, Germans – ä and so on. However, this led to appear more than one ‘standard’ for the same language. For example, Polish language uses nine extra letters: ą, ę, ś, ć, ź, ń, ł, ó and ż. In early 90-ies they could save them in at least four ways:

  • DOS Codepage 852
  • Windows Codepage 1250
  • ISO-8859-2
  • Mazovia

They were incompatible one with another – below, we can see the codes of the ł letter in each of them:

  • DOS Codepage 852: 152
  • Windows Codepage 1250: 156
  • ISO-8859-2: 182
  • Mazovia: 158

Suppose we saved the word świat (pl. world) in Windows codepage. If we opened it with the software written for ISO encoding, we would see œwiat (ISO-8859-2 does not use the 152 code). We encounter even more serious problems, if we try to use French and Polish characters in the same document. French language used ISO-8895-1 encoding, and some of the letters have the same codes which makes it impossible to distinguish them. In the past, nearly all software suffered from the encoding problems, and the computer users must have been the masters of character conversions.

What is more, the extra 128 symbols were still not enough for such languages, as Chinese or Japanese.

Unicode – a universal encoding

All the encodings we have mentioned so far are so-called one-byte encodings, because every symbol can be encoded with one byte. However, they have many drawbacks, so a new approach was proposed: Unicode, developed by a consortium created by leading software, hardware companies, and different institutes. Unicode is a standard that provides a comprehensive support for all the issues related to symbols and characters. It consists of several components:

  1. Symbol tables – Unicode aims to cover all the possible symbols on the planet. Each symbol must have its own, unique number even if it requires reaching very high values. The first 256 numbers are compatible with ISO-8859-1 standard.
  2. Collation and comparison algorithms – define the universal lexicographic order between every two Unicode symbols, including the language-specific extensions.
  3. Binary representations – in Unicode, the symbol number is not associated with a particular binary sequence. There are several binary representations developed for different purposes: UTF-8, UTF-16, UCS-16 etc.

As we can see, in Unicode the binary representation is separated from the symbol numbers. For example, in this standard the letter ś has the number 347. In the computer text it can be encoded in several different ways:

  • UTF-8: 1100 0101 1001 1011
  • UTF-16: 0101 1011 0000 0001
  • UCS-16: 0000 0001 0101 1011

Despite different physical representations, these sequences still mean the same number: 347, allowing every computer which understands Unicode to display it properly. Before we start using Unicode, we must choose one of the representations that the best suits our needs. In UTF-8, the first 128 symbols are encoded in one byte, making it fully compliant with the original 7-bit ASCII. The symbols with numbers from 128 to 2047 use two bytes, 2048 to 65 535 use three bytes and so on. UTF-8 is perfect for European languages thanks to the optimal use of space. However, in Chinese or Japanese it would lead to the singificant lenghtening of the original text, because it requires much more space to encode the symbols with higher numbers. Of course we can still put them in our texts, but if we are developing a website that would contain primarily Chinese articles, we should consider choosing another Unicode representation, such as UCS-16.

Unicode pros and cons

Pros:

  • Allows to encode every symbol of the planet in one document.
  • The symbols have unique numbers, easing the process of text processing.
  • Separation of the symbol numbers from the binary representations.
  • The standard provides the support for comparing the strings in the lexicographic order.

Cons:

  • Unicode is a multibyte encoding. The traditional software written for one-byte encodings would have serious problems with processing it.
  • Multibyte encoding requires more complex processing algorithms.
  • It is a bit more complex in use.

Why we should use Unicode on websites?

The home page used for publishing static articles probably does not need Unicode at this time, but every web application focusing on community activity and social events should seriously consider implementing it. Even if the main content is still written in English, the users may use it for many different purposes, such as language teaching or help with translation. The national symbols often appear in people’s first or last names that may be used despite the fact they write in English. Personally, I’m not very happy, if I want to sign my comment with my real name and the application throws an error, because it does not know, how to process the ę letter (i.e. Technorati).

Anyway, inspite of some technical problems, Unicode is the future. If you are seriously considering writing a community application and supporting other-than-English languages, you are supposed to use it instead of ISO-8859-x etc.

Unicode support in PHP and MySQL

MySQL provides a comprehensive support for Unicode since version 4. The programmers may select character encodings and collations for nearly all database items: databases, tables, columns, connections, etc.

PHP language parser is written for one-byte encodings. The ordinary string processing functions do not recognize Unicode sequences. Our multibyte ś letter occupies two bytes and each of them is accessed separately. Applying strlen() on świat world shows that it is 6 characters long instead of 5, because it counts bytes, not physical symbols. There are some exceptions:

  • PCRE regular expressions understand and handle Unicode sequences properly.
  • The “multibyte” extension provides reimplementations of the most common string functions that handle multibyte symbols.

The first PHP version with native Unicode support will be PHP 6, currently under heavy development. The Unicode processing algorithms will be implemented in all extensions, all functions and the internal parser structures. PHP 5.3 provides the built-in partial support for Unicode INTL library that simplifies writing international software with Unicode.

Configuring MySQL for Unicode

It’s time to show, how to configure our applications to use Unicode. There are some tricky issues we must remember about. First of all, we will configure the database. We are obliged to select the character encoding and collation algorithms for each item in the database.

  • Character encoding – defines the internal representation of the string data. Example encodings are latin1 (Western European languages, ISO-8859-1), latin2 (Central/Eastern European languages, ISO-8859-2) and utf8.
  • Collation – defines the algorithm of comparing and determining the lexicographical order.

To understand the concept of a collation, consider the following example. We have a list of words stored in a MySQL table and want to get them sorted:

system
świat
artysta
grom
duży
las
ławka
zebra
natura

The naive sorting algorithms work, because the basic 26 latin letters are assigned the codes in the lexicographical order: a is 97, b 98 and so on. But the extra national letters use bigger codes and thus, we would get:

artysta
duży
grom
las
natura
system
zebra
świat
ławka

And this is not what we expected. Our algorithm must know that in Polish, “ł” goes after “l” and “ś” after “s”. This knowledge is provided by the collation algorithm. Once we apply it (latin2_polish_ci in MySQL and ISO-8859-2 encoding) we will get the correctly sorted words:

artysta
duży
grom
las
ławka
natura
system
świat
zebra

In case of UTF-8 we have two basic collations: utf8_general_ci and utf8_unicode_ci. There are some differences between them:

  • utf8_general_ci is a bit simplified version of the Unicode Collation Algorithm. It is generally faster than utf8_unicode_ci, but slightly less correct. For example, in German and some other languages ß is equal to ss. This collation treats it as s.
  • utf8_unicode_ci aims to provide a full implementation of Unicode Collation Algorithm. The goal has not been achieved yet. According to the MySQL user manual, there are some unsupported characters and the combining characters do not work correctly in some cases. This primarily affects Vietnamese and some smaller languages like Navajo.

Despite the general collation algorithms, some languages use additional rules that must be implemented independently. French language is quite well covered by the general algorithm, so we do not need an extra collation for it, so we just use either utf8_general_ci or utf8_unicode_ci. However, as Kazoui pointed out in comments, there are some small differences between the general collation algorithm and French language rules.

This is not true in case of Polish which has an extended collation utf8_polish_ci or Slovak (utf8_slovak_ci). Actually, I do not exactly know, what extension is applied in the Polish collation :) but in the second one the algorithm must know i.e. that the digraph ch must appear right after h. Of course selecting such an extra collation does not mean that we loose the ability to recognize Frech language. As I said before, they are just extensions to the default algorithm.

If we do not want to use a collator (i.e. for binary data), we should choose utf8_bin collator which compares the binary codes of the characters.

The last thing we need to do is to select the connection encoding once we connect to the database:

$pdo = new PDO('mysql:connection settings', 'user', 'password');
$pdo->query('SET NAMES `utf8`');

Now our database is configured to use UTF-8.

Configuring PHP to use Unicode

As we mentioned before, PHP does not support Unicode internally, so currently there is little we can do. We must be aware that such functions as strlen() can give invalid results for UTF-8 sequences and some operations may even break them (i.e. the direct access with $string[13]). We can either live with it or install extra PHP libraries such as CentralNIC Unicode Library:

<?php
require('Unicode.php');

$string = new Unicode_String();
$string->fromUTF8("Ĥēĺļŏ, Ŵőřļď!");

print "String contains ".$string->length()." characters.\n";

print "String contains characters from the following blocks:".implode(', ', $string->blocks())."\n";

print "String in UTF-8: ".$string->toUTF8()."\n";

print "String in UPPERCASE: ".$string->toUpper()->toUTF8()."\n";

$comma = new Unicode_Character(ord(','));

$words = explode($comma);

foreach ($words as $word) {
    print "Word has ".$word->length()." characters.\n";
}

The example comes from the summary on the project website and shows, how to manipulate the Unicode strings with extra classes the library provides.

Since PHP 5.3.0 we can perform certain Unicode manipulations with native functions implemented in INTL library. For example, to sort an array with the specified collation we could use the following code:

$array = array('zebra', 'świat', 'system');

$collator = new Collator('pl_PL');
$collator->sort($array);

var_dump($array);

More information about INTL library can be found in PHP manual and Internationalization in PHP 5.3 by Stas Malyshev.

Configuring the browser to render Unicode

Finally, we must inform the browser that the response is encoded in UTF-8. This is a bit tricky process, because in case of this encoding, most browsers… ignore the appropriate META tag in the HTML code. The only way to inform the browser about UTF-8 content is to send a HTTP header:

header('Content-type: text/html;charset=utf-8');

The header must be send before any HTML content. It guarantees us that the properly encoded UTF-8 text will be displayed as UTF-8, and that every input, such as POST data will be also encoded with it.

Troubleshooting

If you encounter problems with Unicode in your PHP application, check the following things:

  • Does your website sends the proper HTTP header?
  • Does your database has the proper character encodings and collations selected for the databases, tables and columns?
  • Does your script sends the SET NAMES query that configures the connection encoding?
  • Does your script use some functions that break/affect the UTF-8 sequences?

Useful software

Below, you can find some useful libraries and scripts that may be useful when working with UTF-8:

  • CentralNIC Unicode Library – described above.
  • Kohana Framework – a fork of CodeIgniter rewritten to PHP5 provides full compatibility with UTF-8 at the system classes and architecture.
  • Drupal CMS natively uses UTF-8.
  • MediaWiki, the wiki engine that powers Wikipedia provides excellent and well-tested Unicode support.
  • Open Power Template 2 provides a partial Unicode support. It helps sending the proper HTTP headers, properly converts HTML/XML entities and optionally supports Unicode in XML tag names.

Summary

Due to the lack of the native support in PHP language and more complex processing algorithms, using Unicode on web application can still be a bit tricky, but it is certainly possible. There are many successful PHP applications that use it natively and they show it is definitely worth risking. I’ve been using UTF-8 on my projects and websites for a couple of years so far and I know it was the right choice. I hope that this article helped you switching your application to this standard and showed you, what problems can occur for non-English users of the application, if it is not properly written.

Advertisements
By Rz Rasel Posted in Php

2 comments on “Speak In Unicode With Php And MySql

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s