When dealing with databases and arbitrary strings, one of the most annoying pitfalls is encoding. Most likely 99.8% of all active developers are able to tell stories, ranging from horrible to hilarious.
Same applies to myself. Eventually, probably after such an incident, one of my former mentors told me:
Just use UTF-8 everywhere and you won’t have any problems.
I did exactly that, and he was right. Things worked and I haven’t seen a questionmark or similar placehoders. Until now.
My scenario was quite simple: a SpringBoot Kotlin application with a GraphQL API, a downstream MariaDB with collation set to utf8_general_ci
and an upstream frontend application.
One of the applications tasks was, adding emojis to specific recognized patterns in plain text user inputs, prior to storing them in the database. Some of the used emojis are ✨ and 🦄.
When testing end to end, I observed that entries with added sparkles were processed correctly, but any entry with an added unicorn vanished. No �
or ▯
, the record was simply not saved to database with the following error:
1 | WARN ...SqlExceptionHelper: SQL Error: 1366, SQLState: 22007 |
Testing a few different emojis yielded the completely unscientific observation, “old” emojis like smileys work, “new” emojis like the lollipop don’t. And as this already had an encoding smell, I dug into the unicode character representations, finding that both emojis in question do have different byte sizes:
✨ is three bytes long e2 9c a8
🦄 is four bytes long f0 9f a6 84
So next question is, does the used character set of the database, utf8_general_ci
, support four byte unicode characters?
Short answer: No. Use utf8mb4_unicode_ci
instead.
Long answer:
In MySQL resp. MariaDB, utf8
is an alias for utf8mb3
. The abbreviation mb
stands for multi byte and the following number determines how many bytes the character may consist of at max - here 3.
And as the unicorn consists of four bytes, this is the root cause of the Incorrect string value...
error.
Conclusion: The mentioned advice needs to be modified:
Just use utf8mb4_unicode_ci everywhere and you won’t have any problems.
Honestly, this is not the most future-proof advice, but until I have had the time to play around with other character sets and find a better alternative, I will stick to this rule of thumb.
This would have been a great bug to find in initial development. Prior to having a somewhat mature web application and a database filled with thousands of records. But it was not.
So, as the application already used Flyway, the plan was, to create a migration script that modifies the collation of the database, every tables and every string resp. varchar column.
1 | ALTER DATABASE `emoji_hell` |
This example works fine, until you reach a varchar column, used in a foreign key constraint.
1 | SQL Error (1832): Cannot change column 'user_id': used in a foreign key constraint 'FK1misndtpfm9hx3ttvixdus8d1' |
Now the foreign key check complains, that the two columns would not match afterwards.
In my scenario, disabling the check did not help, but for the sake of completeness, this is how it should work:
1 | SET foreign_key_checks=0; |
Anyway, I chose to walk the extra mile and drop all foreign keys, modify database, tables and columns and add new foreign keys again. Also, as I do not know, if the foreign key names are of any importance to the persistence framework, I decided I’ll preserve them.
It turned out, that this was another extra mile.
1 | SET @post_user_fk_name = ( |
This sample extracts a foreign key name into a variable, which is then used in the drop as well as the add statement. It is then concatinated into an actual statement string, prepared into a prepared statement, executed and deallocated after use.
After doing the actual database modification, we can recreate the foreign keys in almost the same procedure:
1 | SET @post_user_add_fk_statement = |
And, we’re done.
It is now possible to save unicorns to the database 🦄
Yes, it was a pain, but actually a fun challenge. At least it was a reminder, to stay up to date with character sets and also forgotten things like SQL. The foreign key name preservation really took me quite a while, but I really wanted to avoid any further problems. Especially as having issues with the persistence layer not finding some arbitrary foreign key names sound like the are terrible to debug.