Emojis, collations and prepared statements - A new path to encoding hell

2021-08-09

Chapter 1 - Prologue

When dealing with databases and arbitrary strings, one of the most annoying pitfalls is encoding. Most likely 99.8% of all active developers are able to tell stories, ranging from horrible to hilarious.

Same applies to myself. Eventually, probably after such an incident, one of my former mentors told me:

Just use UTF-8 everywhere and you won’t have any problems.

I did exactly that, and he was right. Things worked and I haven’t seen a questionmark or similar placehoders. Until now.

Chapter 2 - A new future

My scenario was quite simple: a SpringBoot Kotlin application with a GraphQL API, a downstream MariaDB with collation set to utf8_general_ci and an upstream frontend application.
One of the applications tasks was, adding emojis to specific recognized patterns in plain text user inputs, prior to storing them in the database. Some of the used emojis are ✨ and 🦄.

When testing end to end, I observed that entries with added sparkles were processed correctly, but any entry with an added unicorn vanished. No � or ▯, the record was simply not saved to database with the following error:

1 2	WARN ...SqlExceptionHelper: SQL Error: 1366, SQLState: 22007 ERROR ...SqlExceptionHelper: Incorrect string value: '\xF0\x9F\xA6\x84' for column 'caption' at row 1

Testing a few different emojis yielded the completely unscientific observation, “old” emojis like smileys work, “new” emojis like the lollipop don’t. And as this already had an encoding smell, I dug into the unicode character representations, finding that both emojis in question do have different byte sizes:

✨ is three bytes long e2 9c a8
🦄 is four bytes long f0 9f a6 84

So next question is, does the used character set of the database, utf8_general_ci, support four byte unicode characters?

Short answer: No. Use utf8mb4_unicode_ci instead.

Long answer:
In MySQL resp. MariaDB, utf8 is an alias for utf8mb3. The abbreviation mb stands for multi byte and the following number determines how many bytes the character may consist of at max - here 3.
And as the unicorn consists of four bytes, this is the root cause of the Incorrect string value... error.

Conclusion: The mentioned advice needs to be modified:

Just use utf8mb4_unicode_ci everywhere and you won’t have any problems.

Honestly, this is not the most future-proof advice, but until I have had the time to play around with other character sets and find a better alternative, I will stick to this rule of thumb.

Chapter 3 - Changing the past

This would have been a great bug to find in initial development. Prior to having a somewhat mature web application and a database filled with thousands of records. But it was not.
So, as the application already used Flyway, the plan was, to create a migration script that modifies the collation of the database, every tables and every string resp. varchar column.

ALTER DATABASE `emoji_hell` 
    COLLATE 'utf8mb4_general_ci';
    
ALTER TABLE `post` 
    COLLATE = 'utf8mb4_general_ci';
    
ALTER TABLE `post`
    CHANGE COLUMN `caption` `caption` VARCHAR(255) NOT NULL COLLATE 'utf8mb4_general_ci',
    CHANGE COLUMN `user_id` `user_id` VARCHAR(255) NOT NULL COLLATE 'utf8mb4_general_ci';

This example works fine, until you reach a varchar column, used in a foreign key constraint.

1	SQL Error (1832): Cannot change column 'user_id': used in a foreign key constraint 'FK1misndtpfm9hx3ttvixdus8d1'

Now the foreign key check complains, that the two columns would not match afterwards.
In my scenario, disabling the check did not help, but for the sake of completeness, this is how it should work:

SET foreign_key_checks=0;

ALTER TABLE `post`
    CHANGE COLUMN `caption` `caption` VARCHAR(255) NOT NULL COLLATE 'utf8mb4_general_ci',
    CHANGE COLUMN `user_id` `user_id` VARCHAR(255) NOT NULL COLLATE 'utf8mb4_general_ci';
    
SET foreign_key_checks=1;

Anyway, I chose to walk the extra mile and drop all foreign keys, modify database, tables and columns and add new foreign keys again. Also, as I do not know, if the foreign key names are of any importance to the persistence framework, I decided I’ll preserve them.
It turned out, that this was another extra mile.

SET @post_user_fk_name = (
    SELECT constraint_name
    FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE
    WHERE TABLE_NAME = 'post'
      AND COLUMN_NAME = 'user_id'
      AND CONSTRAINT_SCHEMA = 'emoji_hell'
      AND referenced_table_name = 'user'
      AND referenced_column_name = 'user_id');
      
SET @post_user_drop_fk_statement =
        concat('ALTER TABLE `post` DROP FOREIGN KEY `', @post_user_fk_name, '`');

PREPARE post_user_drop_fk_prep_statement FROM @post_user_drop_fk_statement;

EXECUTE post_user_drop_fk_prep_statement;

DEALLOCATE PREPARE post_user_drop_fk_prep_statement;

This sample extracts a foreign key name into a variable, which is then used in the drop as well as the add statement. It is then concatinated into an actual statement string, prepared into a prepared statement, executed and deallocated after use.

After doing the actual database modification, we can recreate the foreign keys in almost the same procedure:

SET @post_user_add_fk_statement =
        concat('ALTER TABLE `post` ADD CONSTRAINT `', @post_user_fk_name,
               '` FOREIGN KEY (`user_id`) REFERENCES `user` (`user_id`)');

PREPARE post_user_add_fk_prep_statement FROM @post_user_add_fk_statement;

EXECUTE post_user_add_fk_prep_statement;

DEALLOCATE PREPARE post_user_add_fk_prep_statement;

And, we’re done.
It is now possible to save unicorns to the database 🦄

Chapter 4 - Retrospective:

Yes, it was a pain, but actually a fun challenge. At least it was a reminder, to stay up to date with character sets and also forgotten things like SQL. The foreign key name preservation really took me quite a while, but I really wanted to avoid any further problems. Especially as having issues with the persistence layer not finding some arbitrary foreign key names sound like the are terrible to debug.