If you have an old WordPress blog like me, you’ll notice all kinds of problems with accented letters in late WordPress’ versions. The trouble is that WordPress was once young and foolish and created its MySQL database in the default latin-1 character set. Which was all fine and dandy, except the fact that WordPress dumped UTF-8 encoded unicode data into this database.
MySQL didn’t mind and PHP didn’t know anything about Unicode, so no harm done. The trouble began when WordPress actually started requesting UTF-8 data from MySQL. MySQL notices that the data in the tables is in latin-1 format and converts the latin-1 data to UTF-8.
That means that your data is double-encoded. één becomes één and so on. One possible way to solve this problem is to leave the communication between WordPress and MySQL in latin-1. (DB_charset = 'latin1'). The best solution, however, is to fix the database: mark all fields, tables, and the database charset as UTF-8. Trouble is, whenever you do this in for instance phpMyAdmin, MySQL converts the actual data to UTF-8, thus double-encoding the data once and for good.
The solution for this new problem is an intermediate step: mark all latin-1 fields as blobs (binary data) and then change them back again to UTF-8 encoded text fields. This solution works because MySQL doesn’t convert anything from latin-1 to blob, nor from blob to UTF-8. But is is quite labourous and you have to delete and recreate all indexes.
So here’s my solution: More »