If you have an old WordPress blog like me, you’ll notice all kinds of problems with accented letters in late WordPress’ versions. The trouble is that WordPress was once young and foolish and created its MySQL database in the default latin-1 character set. Which was all fine and dandy, except the fact that WordPress dumped UTF-8 encoded unicode data into this database.

MySQL didn’t mind and PHP didn’t know anything about Unicode, so no harm done. The trouble began when WordPress actually started requesting UTF-8 data from MySQL. MySQL notices that the data in the tables is in latin-1 format and converts the latin-1 data to UTF-8.

That means that your data is double-encoded. één becomes één and so on. One possible way to solve this problem is to leave the communication between WordPress and MySQL in latin-1. (DB_charset = 'latin1'). The best solution, however, is to fix the database: mark all fields, tables, and the database charset as UTF-8. Trouble is, whenever you do this in for instance phpMyAdmin, MySQL converts the actual data to UTF-8, thus double-encoding the data once and for good.

The solution for this new problem is an intermediate step: mark all latin-1 fields as blobs (binary data) and then change them back again to UTF-8 encoded text fields. This solution works because MySQL doesn’t convert anything from latin-1 to blob, nor from blob to UTF-8. But is is quite labourous and you have to delete and recreate all indexes.

So here’s my solution:

#!/usr/bin/python

utf8 = [ unichr(i).encode('utf-8') for i in range(256) ]
double_utf8 = [ c.decode('latin1').encode('utf8') for c in utf8 ]

f = open('dbwordpress.sql', 'rb')
data = f.read()
f.close()

for i in range(256):
    data = data.replace(double_utf8[i], utf8[i])
    
data = data.replace ('DEFAULT CHARSET=latin1', 'DEFAULT CHARSET=utf8')

f = open('dbwordpress2.sql', 'wb')
f.write(data)
f.close()

A little explanation:

I took a database backup, dbwordpress.sql, from my provider (in SQL format). The script changes all double-encoded UTF-8 values back to single encoded UTF-8 values. After that it changes the default charset of all tables to UTF-8. I restored the database from my modified backup and all was fine and dandy.

Technical note: utf8 is a table of all latin-1 characters in UTF-8 format. double_utf8 is table of these characters, but encoded to UTF-8 once again. The script replaces all double-encoded latin-1 characters by their single encoded version.

This means that this solution won’t work if any of your posts or comments contain anything other than latin-1-characters, like Chinese or Japanese. In theory you should be able to just do:

data = data.decode('utf-8').encode('latin-1')

but somehow there are other, real Unicode characters (# > 0xff) in the mix, which means that the straight downconvert to 8-bit characters won’t work (the encode('latin-1') part). I don’t know how this is even possible since any 8 bit value converted to UTF-8 yields a valid Unicode character in the latin-1 range… It’s either a bug in MySQL’s latin-1-to-UTF-8 conversion, or, more likely, WordPress shoves some real binary data into text-fields…