Blog How to configure your web application to correctly deal with character set/encoding issues
Jeremy Tunnell at
Background
I'll just link to the best places i've found to read up on the background. Read these first if you want to understand whats going on:
The Details
In order to get your website working harmoniously with one character set, you first have to pick one. I picked ISO-8859-1 (sometimes referred to as latin1). It's the most popular English (Latin) language set, though it doesn't display many foreign characters.
Unfortunately, despite the fact that Unicode is the end-and-be-all PHP is really horrible at dealing with Unicode (specifically multi-byte strings). Since I've got enough to worry about without having to also check every function i'm using for multi-byte compatibility, ISO-8859-1 is right for me, for now. I hear PHP6 will fully support mb strings.
In order to get your web application working correctly with one character set, there are basically three parts you must take care of:
- The database
- The web server
- The web page itself
Database config
In a perfect world, you would compile your database to default to your preferred character set. I won't cover that here.
Assuming you cannot do that, you need to create your tables using the correct set. Keep in mind, at least on MySQL, the character set can be configured all the way down to the column level. Here we will just set it at a database level. (See here for much more detail):
CREATE DATABASE database CHARACTER SET utf8 COLLATE utf8_general_ci;
Next, we need to configure how the database delivers the results of queries to the calling application. Assuming you don't (or can't) ensure this is done at compile time, you can use the following command before sending a query:
SET NAMES charset
According to the linked article above, this is shorthand for setting the character_set_client, character_set_results, and collation_connection variables in MySQL. A good place to put this query is in the constructor of your database access class.
Web Server
Assuming you are using apache, you need to edit a setting in your httpd.conf file (the primary apache config file).
AddDefaultCharSet charset
Set this to be what you prefer apache to tell the client that the default character set is in case one is not specified. (This is important, as in some cases the client will trust this even if one is specified in the document.)
Also, there are options in whatever server side language you're using to set this variable in outgoing headers. In PHP you use the header() function. See the PHP documentation for details.
The web page
Finally, you should specify the character set of the page you're serving with a meta tag at the top below the tag. It looks like this:
<meta http-equiv="Content-Type" content="text/html; charset=charset" />
Getting obsessive
Most browsers support specifying the character set that a form should accept (and what the browser will convert the text to if it is not correct). To manually set this, include the following attribute in the tag:
accept-charset='ISO-8859-1'
Doing the above should ensure that you operate using the same character set throughout your web application, and hopefully garbage characters and question marks will be forever in the past.
Reference
- DBMS and charsets - Settings for various databases.
- More charset settings for databases.
- MS Word and Web Development - Short and specific to MS Word.
- Turning MySQL data in latin1 to utf8 - Case study that gets down to the minute details.
- Recode: Character set conversion tool
- Fixing MYSQL charset issues on existing data.