Don't let Unicode support be the death of you

You Americans (and British, Australians and any other English speakers) have it easy. When you need to create a web application, all you need to use for your text is the basic English alphabet, only 26 different characters and 10 digits. The rest of the world, myself included, aren't so lucky. Our languages are damn complicated, with additional letters, along with such grammatical 'features' such as accents to see where the pronunciation is at its strongest.

Seeing that I was born in Chicago, my first language is English. Although my parents are both from Puerto Rico, they never enforced the usage of Spanish upon me. When I moved to Puerto Rico when I was 8 years old, the only word in Spanish I knew was "Gracias". Unfortunately, I didn't have Dora The Explorer to help me learn the language when I was a boy. It's taken years, but I finally consider myself fully bilingual in both English and Spanish, and am usually very careful as far as spelling and grammar go.

However, I've been wracking my brain for a long time now when making web applications in Spanish. Sometimes the additional characters, like the letter "├▒", or words like "presentaci├│n" are a pain to get working immediately. It's like I have to jump through hoops to get those characters working right.

In case anyone else shares my pain (or in case I forget in the next couple of months), I've compiled a short list with the things you should first look for when these Unicode characters are appearing incorrectly.

Don't forget the <meta> tag

One common mistake web developers make is forgetting to set the default character set of the page they're working on. Without this tag, the browser will automatically set the character set to the browser's default, usually ISO-8859-1, which does not show Unicode characters. To set the character set of the page to UTF-8, which correctly displays Unicode characters, you simply need to add the following meta tag as the first line in the <head> section:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

This should enforce your browser to use UTF-8 when displaying the page's characters. Remember to set this first, before any other tags in the head section, or else it won't work at all.

Check your web server configuration

I remember once working with a PHP application, and pulling my hair out because the characters simply wouldn't display correctly, no matter what I did. After hours of searching through Google, I found out my problem. The Apache Web Server, which was responsible for serving the website, had its default character set set to ISO-8859-1. If you control your own server and can change Apache's configuration, just go to the configuration file named httpd.conf (this name can vary, depending on your Linux distribution) and make sure the AddDefaultCharset option is correctly set:

AddDefaultCharset UTF-8

After reloading the Apache Web Server, your pages should be displaying Unicode characters correctly. With the Lighttpd Web Server, which I'm using now, I haven't had to set any option for correct Unicode support. However, in case someone needs it, just go to your Lighttpd configuration file, go to the mimetype.assign section, search for the .html assignment, and add the following at the end:

".html" => "text/html; charset=utf-8"

Another file to verify, although not necessary in most cases, is the PHP configuration file, named php.ini. PHP is responsible enough to use the encoding set in the page by the meta tag mentioned above, but sometimes some joker decides to change the default character set in the configuration file. In this case, simply comment out the default_charset option, and reload your web server.

The database has data too, you know...

With those two fixes above, your static text should be displaying correctly. However, you notice all Unicode characters stored and retrieved from the database are still being incorrectly displayed. This is due to your database character set not being set to UTF-8. In my app, I'm using MySQL, and the database server's default character set is set to latin1generalcl, which apparently doesn't display Unicode at all. If you don't explicitly indicate which character set you want to use for your database, the default will be used for not only the database, but the rest of the tables (unless explicitly defined, as well). What we want is the utf8_bin character set, which will display the Unicode correctly.

There are different ways to change this default behavior, from starting the database server with an option to change the default character set, to recompiling the entire program (providing it's Open Source). But I find it much easier to just remember to make your database use the correct character set. In the MySQL prompt on the command line, it's as simple as this:


If you have an existing database not using the UTF-8 character set, the easiest way is to use a program like PHPMyAdmin for MySQL, or your preferred GUI for your database server, and change it there. You can also do it through the command line, but I won't go into those details here. Search Google and you'll get a ton of information.

Your text editor has a hand in this too

Don't forget the tool you're using to create your web pages. They could be the ones giving you major headaches. In my case, I'm testing out Intype, which is still in alpha, but very usable. Intype has the nasty habit (which I wish is fixed soon) to automatically set the file's character set to ANSI by default. Once you save the file with this character set, it stays that way, wreaking havoc on what you want to see.

To fix this problem, just make sure your text editor, whether it's Intype, e, Textmate, Vim or any other text editor in vogue right now, is saving your file using the adequate character set. In my case, I'm using UTF-8 Plain with Intype, and my characters are showing up correctly.

These tips should save you a ton of headaches down the road if you're doing web development for a non-English audience. If you have any to add, feel free to do so.

Written by

Dennis Martinez

Show Comments