Fixing MySQL latin1 to utf8mb4 issues

Recently I had to convert a MySQL database using latin1 encoding to utf8mb4 (aka proper unicode). The process took a long time to figure out, but ended up being very simple.

Before explaining, first you need to understand what character encoding is, and you can find out about that on the W3 website. Additionally, character sets (charsets) are methods of encoding characters. The most basic is ASCII, which has 256 possible characters, since each character is stored in 1 byte. The most complete and standard is UTF-8, which is unicode. UTF-8 supports characters up to 8 bytes long, so it supports the unicode Chinese-Japanese-Korean (CJK) character set. 你會中文嗎？Unicode會。

Wikipedia icon

Latin-1 is a subset of Unicode. It sucks. You should always use UTF-8. Unfortunately, a lot of MySQL databases seem to use Latin-1 for no discernible reason. This is what caused my problem.

How to convert latin1 to utf8mb4 (why you clicked this)

Here's how you convert from latin1 to utf8mb4. First, you have to dump to your database using latin1 as your encoding.

mysqldump --default-character-set=latin1 --databases <your database> -r dump_utf8mb4.sql

This will dump your database using latin1 as the connection's charset. This will make sure your dump isn't garbled. If you were to dump using utf8mb4 or something else as the charset, outputted unicode characters would be all messed up. This was a big source of confusion for me before I understood this.

Next, you need to replace instances of "latin1" with "utf8mb4" in your database. In the following example, I'm using the posix "sed" tool.

Your database may have rows that contain the word "latin1". CHECK BEFORE RUNNING THESE COMMANDS, OR RENAME latin1 TO utf8mb4 MANUALLY.

sed -i 's/SET NAMES latin1/SET NAMES utf8mb4/g' dump_utf8mb4.sql

# The next lines are extra measures to correct other instances of latin1.
# If your dump has issues, follow the previous steps again on a new dump, but omit the next lines.
# These lines can unintentionally overwrite data, even though it is unlikely.
sed -i 's/CHARSET=latin1/CHARSET=utf8mb4/g' dump_utf8mb4.sql
sed -i 's/latin1_general_ci/utf8mb4_general_ci/g' dump_utf8mb4.sql
sed -i 's/latin1 /utf8mb4 /g' dump_utf8mb4.sql # This line is the most dangerous, it has more damage potential than the others

Once instances of latin1 are renamed to utf8mb4, your dump should be ready to be used. Use the source command in the MySQL prompt to restore your database from the dump file.

If you don't have access to the original database and only have the dump file

This is a little trickier. First, restore your dump using a connection with latin1 encoding. You can use the --default-character-set=latin1 flag while connecting to do this. Next, follow the steps in the last section on your restored database to convert it to utf8mb4. You can then delete your restored database and then re-restore using your new dump file. The caveat to this is that some characters will just be replaced with question marks. This is the case with Cyrillic and some emojis. This might be due to latin1 only supporting certain parts of unicode, but I'm not sure. Either way, you won't be able to restore all of it, at least as far as I can tell. If you have any method of dumping from the original database, do that instead.

Caveats

There are sometimes issues with this method. Sometimes rows aren't understood correctly and will fail to be restored from the dump. This was the case with a (significant) minority of rows in a large production database that I had to migrate. If you are having issues, the best thing you can do is import with latin1 encoding on both ends, and then try to convert it later with MySQL by altering the database/table character encoding. Always test your results before accepting them, and always keep backups.