TechEarl

MySQL utf8 to utf8mb4 Migration: The Index-Length and ROW_FORMAT Trap

MySQL utf8 is a 3-byte alias for utf8mb3 that cannot store emoji or many CJK characters. utf8mb4 is the real UTF-8 and has been the default since MySQL 8.0. The migration looks like ALTER TABLE but bites on index length and ROW_FORMAT.

Ishan KarunaratneIshan Karunaratne⏱️ 12 min readUpdated
Migrate MySQL from utf8 to utf8mb4: convert databases, tables, and columns; fix the 767-byte index-length limit with ROW_FORMAT=DYNAMIC and innodb_large_prefix; restore connection charset.

MySQL's utf8 character set is a historical mistake. It is a 3-byte-per-codepoint subset of UTF-8 that cannot store characters in the Supplementary Multilingual Plane: emoji, many CJK ideographs, mathematical symbols, ancient scripts. The real 4-byte UTF-8 character set is utf8mb4, available since MySQL 5.5.3 (2010) and the default since MySQL 8.0 (2018). The utf8 name is now a deprecated alias for utf8mb3 and will eventually be removed. Migrating an existing database from utf8 to utf8mb4 is mechanical at the schema level but has one classic trap: indexes on long VARCHAR columns hit the 767-byte InnoDB index-length limit, because the per-character byte budget jumped from 3 to 4. The fix is ROW_FORMAT=DYNAMIC (which raises the limit to 3072 bytes), but you have to know to apply it. Below is the full migration with the index-length math and the connection-string fix.

How do I migrate a MySQL database from utf8 to utf8mb4?

The migration is a four-step procedure. First, convert the database default character set: ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci. This only affects new tables created without explicit charset clauses; existing tables are unchanged. Second, convert each existing table: ALTER TABLE my_table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci. This rewrites the table and every column's data. Third, set ROW_FORMAT=DYNAMIC on InnoDB tables before the conversion if any indexed VARCHAR columns are larger than 191 characters — utf8mb4 needs 4 bytes per character so a VARCHAR(255) index would be 1020 bytes, blowing past the 767-byte limit on the legacy COMPACT row format. Fourth, fix the client connection charset: SET NAMES utf8mb4 in the client driver config so inserts actually send 4-byte UTF-8 over the wire. The conversion is online on 5.6+ but rewrites every byte; plan I/O accordingly.

Jump to:

Why utf8 was a mistake

The Unicode 4.0 spec (2003) covered codepoints up to U+10FFFF, encoded in UTF-8 as 1-4 bytes. MySQL added utf8 support around the same time but implemented only the 3-byte subset (codepoints up to U+FFFF), apparently for indexing reasons. By the time anyone noticed, the name was set and breaking it would have invalidated every CREATE TABLE in the wild.

The 3-byte subset misses:

  • All emoji (😀 is U+1F600, well above U+FFFF)
  • Many CJK ideographs in the Supplementary Ideographic Plane (CJK Extension B and later)
  • Most mathematical symbols, some musical notation
  • Most ancient scripts (cuneiform, hieroglyphs, runes)

An app that accepts user input over a utf8-encoded column truncates or errors on any of the above. The classic symptom for English-language SaaS in 2010-2018 was emoji in user names or comments causing Incorrect string value: '\xF0\x9F...' errors.

utf8mb4 is the actual UTF-8: variable-length 1-4 bytes, all of Unicode covered. It's the default for new databases on MySQL 8.0+. MariaDB switched the default to utf8mb4 in 10.6.

Pre-flight checklist

Before running ALTERs:

sql
-- Which databases are on utf8 / utf8mb3?
SELECT schema_name, default_character_set_name, default_collation_name
FROM information_schema.schemata
WHERE default_character_set_name IN ('utf8', 'utf8mb3');

-- Which tables have utf8 / utf8mb3 columns?
SELECT table_schema, table_name, COUNT(*) AS columns_to_fix
FROM information_schema.columns
WHERE character_set_name IN ('utf8', 'utf8mb3')
GROUP BY table_schema, table_name
ORDER BY columns_to_fix DESC;

-- Total size of tables that need conversion (estimate of I/O cost)
SELECT
    SUM(data_length + index_length) / 1024 / 1024 / 1024 AS total_gb
FROM information_schema.tables t
JOIN information_schema.columns c
    ON c.table_schema = t.table_schema
   AND c.table_name = t.table_name
WHERE c.character_set_name IN ('utf8', 'utf8mb3');

-- Indexed VARCHAR columns wider than 191 chars (will need ROW_FORMAT=DYNAMIC)
SELECT s.table_schema, s.table_name, s.index_name,
       c.column_name, c.character_maximum_length
FROM information_schema.statistics s
JOIN information_schema.columns c
    ON c.table_schema = s.table_schema
   AND c.table_name = s.table_name
   AND c.column_name = s.column_name
WHERE c.character_maximum_length > 191
  AND c.data_type IN ('varchar', 'char')
ORDER BY c.character_maximum_length DESC;

The last query is the most important. Every row in the output is an index that will fail on a naive CONVERT TO CHARACTER SET utf8mb4 unless you've set ROW_FORMAT=DYNAMIC first.

Take a logical backup, same procedure as the 5.7 to 8.0 migration.

Convert the database default

This only affects new tables. Existing tables in the database are unchanged.

sql
ALTER DATABASE mydb
    CHARACTER SET utf8mb4
    COLLATE utf8mb4_0900_ai_ci;

On MySQL 5.7 (or MariaDB), use utf8mb4_unicode_ci instead of utf8mb4_0900_ai_ci since the 0900 collation requires 8.0.

Convert each table

For one table:

sql
ALTER TABLE my_table
    CONVERT TO CHARACTER SET utf8mb4
    COLLATE utf8mb4_0900_ai_ci;

CONVERT TO CHARACTER SET rewrites every row, updating the on-disk byte representation. It is not the same as DEFAULT CHARACTER SET:

SyntaxWhat it does
ALTER TABLE t DEFAULT CHARACTER SET utf8mb4Changes the default for future column additions only. Existing columns unchanged.
ALTER TABLE t CONVERT TO CHARACTER SET utf8mb4Rewrites every existing column. The real conversion.

Most "I migrated to utf8mb4 but still can't store emoji" posts on Stack Overflow are people who ran the first form expecting it to do the second.

For a bulk-convert script over a whole database, the same trick as in the MyISAM to InnoDB conversion: generate the SQL, then execute it.

bash
# 1. Generate the ALTER statements
mysql -u root -p -N -e "
    SELECT CONCAT('ALTER TABLE \`', table_schema, '\`.\`', table_name,
        '\` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;')
    FROM information_schema.tables
    WHERE table_schema = 'mydb'
      AND table_collation NOT LIKE 'utf8mb4%';
" > convert-utf8mb4.sql

# 2. Review
head convert-utf8mb4.sql

# 3. Execute
mysql -u root -p --verbose mydb < convert-utf8mb4.sql 2>&1 | tee convert-utf8mb4.log

Filter table_schema = 'mydb' to one database at a time. Converting everything across all schemas in one shot tends to compound failures.

The 767-byte index-length trap

This is the trap. InnoDB had a 767-byte limit on the maximum index key prefix length on the legacy COMPACT and REDUNDANT row formats. With utf8 (3 bytes/char), a VARCHAR(255) indexed column needs 255 × 3 = 765 bytes — just under the limit. Tight, but it fits.

With utf8mb4 (4 bytes/char), the same VARCHAR(255) needs 255 × 4 = 1020 bytes — over the limit. The ALTER fails with:

code
ERROR 1709 (HY000): Index column size too large. The maximum column size is 767 bytes.

Or in some versions:

code
ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes.

The historical workaround was VARCHAR(191) for indexed string columns — 191 × 4 = 764 bytes, just under the limit. You'll see this in old WordPress wpdb schemas and many tutorials. It's a hack to avoid the trap, not a real fix.

The real fix is ROW_FORMAT=DYNAMIC plus innodb_large_prefix=ON (the latter was the default since 5.7.7 and is required since 8.0 — the option was removed in 8.0).

Set ROW_FORMAT=DYNAMIC

Before the CONVERT TO CHARACTER SET:

sql
-- Check current ROW_FORMAT
SELECT name, row_format
FROM information_schema.innodb_tables
WHERE name LIKE 'mydb/%';

-- For each table not already DYNAMIC:
ALTER TABLE my_table ROW_FORMAT=DYNAMIC;

DYNAMIC (and COMPRESSED) raise the index prefix limit to 3072 bytes — enough for VARCHAR(768) on utf8mb4 — and store long variable-length columns off-page when needed. There's no real downside; it's been the default for new tables since MySQL 5.7.9.

On MySQL versions before 5.7.9, you also need:

ini
# /etc/mysql/mysql.conf.d/mysqld.cnf
[mysqld]
innodb_file_per_table = ON      # Required for DYNAMIC
innodb_file_format    = Barracuda
innodb_large_prefix   = ON

These options are removed in 8.0 because their old behavior is mandatory and there's no other path. Don't set them on 8.0+.

For the bulk conversion, do both in one pass:

sql
SELECT CONCAT('ALTER TABLE `', table_schema, '`.`', table_name,
    '` ROW_FORMAT=DYNAMIC, CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;')
FROM information_schema.tables
WHERE table_schema = 'mydb'
  AND (row_format != 'Dynamic' OR table_collation NOT LIKE 'utf8mb4%');

This sets ROW_FORMAT=DYNAMIC and runs the charset conversion in a single ALTER per table, which is faster than two separate ALTERs (one rewrite, not two).

Fix the connection charset

Schema conversion is half the work. The client connection charset also matters. If your application connects with SET NAMES utf8 (the historical default for many MySQL client libraries), every INSERT and SELECT translates between utf8 and the table charset on the fly — and 4-byte UTF-8 characters get truncated or rejected on the way in.

Fix it once per connection:

sql
SET NAMES utf8mb4 COLLATE utf8mb4_0900_ai_ci;

Better: configure it at the driver level so every new connection starts in utf8mb4.

StackWhere
PHP / mysqli$mysqli->set_charset('utf8mb4') after connect
PHP / PDO_MySQLDSN: mysql:host=...;dbname=...;charset=utf8mb4
Node.js / mysql2createConnection({ charset: 'utf8mb4_unicode_ci' })
Python / mysqlclientMySQLdb.connect(charset='utf8mb4')
Python / PyMySQLpymysql.connect(charset='utf8mb4')
Java / JDBCURL param: ?characterEncoding=utf8&useUnicode=true
Go / go-sql-driverDSN: ?charset=utf8mb4&collation=utf8mb4_unicode_ci
Rails / mysql2 adapterdatabase.yml: encoding: utf8mb4
WordPresswp-config.php: define('DB_CHARSET', 'utf8mb4') (default since WP 4.2)

Server-side default (every new connection without an explicit SET NAMES):

ini
[mysqld]
character_set_server   = utf8mb4
collation_server       = utf8mb4_0900_ai_ci

Choose the right collation

Two reasonable choices on MySQL 8.0+:

  • utf8mb4_0900_ai_ci — Unicode 9.0, accent-insensitive, case-insensitive. The default for new 8.0 databases. Fast, modern, handles emoji ordering correctly.
  • utf8mb4_unicode_ci — Unicode 4.0, accent-insensitive, case-insensitive. The portable older default. Use on MariaDB and on MySQL 5.7 where 0900 isn't available.

Avoid:

  • utf8mb4_general_ci — buggy comparison logic; faster than the others but produces wrong results for non-Latin alphabets.
  • utf8mb4_bin — binary comparison; case-sensitive, accent-sensitive. Only use deliberately for case-sensitive identifier columns.

Mixing collations across tables joined by a foreign key causes implicit conversions on JOIN and can prevent index use. Pick one collation for the database and stick with it.

Verification

After conversion:

sql
-- All schemas should be utf8mb4
SELECT schema_name, default_character_set_name
FROM information_schema.schemata
WHERE default_character_set_name != 'utf8mb4'
  AND schema_name NOT IN ('mysql','sys','information_schema','performance_schema');

-- All tables should be utf8mb4
SELECT table_schema, table_name, table_collation
FROM information_schema.tables
WHERE table_collation NOT LIKE 'utf8mb4%'
  AND table_schema NOT IN ('mysql','sys','information_schema','performance_schema');

-- No utf8 or utf8mb3 columns remain
SELECT table_schema, table_name, column_name, character_set_name
FROM information_schema.columns
WHERE character_set_name IN ('utf8', 'utf8mb3');

-- Pick a representative table and test an emoji insert/read
CREATE TABLE _emoji_test (
    name VARCHAR(255)
) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
INSERT INTO _emoji_test VALUES ('Hello 🌍 world');
SELECT * FROM _emoji_test;
-- Should return: 'Hello 🌍 world'
DROP TABLE _emoji_test;

If the emoji test fails (truncated to "Hello " or shows mojibake), the issue is the connection charset, not the table charset. Set SET NAMES utf8mb4 in the client and re-test.

What to do next

After utf8mb4 migration:

External references:

FAQ

TagsMySQLutf8utf8mb4Character SetMigrationDatabase AdministrationUnicode
Share
Ishan Karunaratne

Ishan Karunaratne

Tech Architect · Software Engineer · AI/DevOps

Tech architect and software engineer with 20+ years across software, Linux systems, DevOps, and infrastructure — and a more recent focus on AI. Currently Chief Technology Officer at a tech startup in the healthcare space.

Keep reading

Related posts

Convert every MyISAM table to InnoDB: why MyISAM is dead (no transactions, no FKs, crash-unsafe), the ALTER TABLE syntax, FULLTEXT considerations, and a script to convert all tables in a database.

MySQL MyISAM to InnoDB Conversion: Why and How

MyISAM has no transactions, no foreign keys, and corrupts on crash. InnoDB has been the MySQL default since 5.5 and is the only engine that gets new features. Here is how to convert every MyISAM table in a MySQL database with a single SQL script.