Before explaining, first you need to understand what character encoding is, and you can find out about that on the W3 website. Additionally, character sets (charsets) are methods of encoding characters. The most basic is ASCII, which has 256 possible characters, since each character is stored in 1 byte. The most complete and standard is UTF-8, which is unicode. UTF-8 supports characters up to 8 bytes long, so it supports the unicode Chinese-Japanese-Korean (CJK) character set. 你會中文嗎?Unicode會。
Latin-1 is a subset of Unicode. It sucks. You should always use UTF-8. Unfortunately, a lot of MySQL databases seem to use Latin-1 for no discernible reason. This is what caused my problem.
How to convert latin1 to utf8mb4 (why you clicked this)
Here's how you convert from latin1 to utf8mb4. First, you have to dump to your database using latin1 as your encoding.mysqldump --default-character-set=latin1 --databases <your database> -r dump_utf8mb4.sqlThis will dump your database using latin1 as the connection's charset. This will make sure your dump isn't garbled. If you were to dump using utf8mb4 or something else as the charset, outputted unicode characters would be all messed up. This was a big source of confusion for me before I understood this.
Next, you need to replace instances of "latin1" with "utf8mb4" in your database. In the following example, I'm using the posix "sed" tool.
Your database may have rows that contain the word "latin1". CHECK BEFORE RUNNING THESE COMMANDS, OR RENAME latin1 TO utf8mb4 MANUALLY.
sed -i 's/SET NAMES latin1/SET NAMES utf8mb4/g' dump_utf8mb4.sql # The next lines are extra measures to correct other instances of latin1. # If your dump has issues, follow the previous steps again on a new dump, but omit the next lines. # These lines can unintentionally overwrite data, even though it is unlikely. sed -i 's/CHARSET=latin1/CHARSET=utf8mb4/g' dump_utf8mb4.sql sed -i 's/latin1_general_ci/utf8mb4_general_ci/g dump_utf8mb4.sql sed -i 's/latin1 /utf8mb4 /g' dump_utf8mb4.sql # This line is the most dangerous, it has more damage potential than the othersOnce instances of latin1 are renamed to utf8mb4, your dump should be ready to be used. Use the
source
command in the MySQL prompt to restore your database from the dump file.If you don't have access to the original database and only have the dump file
This is a little trickier. First, restore your dump using a connection with latin1 encoding. You can use the--default-character-set=latin1
flag while connecting to do this. Next, follow the steps in the last section on your restored database to convert it to utf8mb4. You can then delete your restored database and then re-restore using your new dump file. The caveat to this is that some characters will just be replaced with question marks. This is the case with Cyrillic and some emojis. This might be due to latin1 only supporting certain parts of unicode, but I'm not sure. Either way, you won't be able to restore all of it, at least as far as I can tell. If you have any method of dumping from the original database, do that instead.