TechEarl

How to Find Duplicate Rows in MySQL

Find duplicate rows in MySQL with GROUP BY HAVING, a ROW_NUMBER window function, or a self-join. Includes NULL behaviour, soft duplicates, and the right index.

Ishan KarunaratneIshan Karunaratne⏱️ 14 min readUpdated
Macro photograph of two near-identical brass keys laid side by side on a dark slate surface under a single warm side light, slight reflection on the metal, shallow depth of field, moody editorial mood

Three reliable ways to find duplicate rows in MySQL: a GROUP BY ... HAVING query (the classic, works on every version), a ROW_NUMBER() window function (MySQL 8.0 and up, cleanest setup if you plan to delete the duplicates next), and a self-join on the duplicate key (useful when you need every individual row, not just one per group). Pick the one that matches your MySQL version and what you plan to do with the result.

One of the first messy tables I ever inherited was a customers table that had been written without a unique constraint on email. After a few years of CSV imports and re-imports from various marketing tools, the same address showed up dozens of times. Before I could add the constraint, I needed to figure out which rows were the duplicates and which copy to keep. This is the SQL I have come back to every couple of years since, in three flavours, with the MySQL version each one needs spelled out so the right query for your server is obvious.

The example table

A small customers table is enough to demonstrate every technique. A few rows are deliberate duplicates on email, plus one with trailing whitespace and one with a different case so we can talk about soft duplicates later.

SQL
CREATE TABLE customers (
  id          INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
  first_name  VARCHAR(50)  NOT NULL,
  last_name   VARCHAR(50)  NOT NULL,
  email       VARCHAR(255) NOT NULL,
  phone       VARCHAR(20),
  created_at  DATETIME     NOT NULL DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO customers (first_name, last_name, email, phone) VALUES
('Mei',    'Tanaka',   'mei.tanaka@example.com',     '555-0142'),
('Mei',    'Tanaka',   'mei.tanaka@example.com',     '555-0142'),
('James',  'Walker',   'james.walker@example.com',   NULL),
('Emma',   'Carter',   'emma.carter@example.com',    '555-0118'),
('Emma',   'Carter',   ' emma.carter@example.com ',  '555-0118'),
('Sofia',  'Marek',    'sofia.marek@example.com',    '555-0177'),
('Sofia',  'Marek',    'SOFIA.MAREK@example.com',    '555-0177'),
('Daniel', 'Brooks',   'daniel.brooks@example.com',  NULL),
('Daniel', 'Brooks',   'daniel.brooks@example.com',  NULL),
('Lukas',  'Becker',   'lukas.becker@example.com',   '555-0102');
Try it with your own values

Jump to:

How do I find duplicate rows in MySQL?

Three queries, pick by what you need next. For a quick audit ("how many duplicates do I have?"), use GROUP BY ... HAVING COUNT(*) > 1: SELECT email, COUNT(*) AS dup_count FROM customers GROUP BY email HAVING COUNT(*) > 1 ORDER BY dup_count DESC;. Works on every MySQL version, returns one row per duplicate value with its count. To get every individual duplicate row tagged with a position so the next step is a DELETE, use the MySQL 8.0+ window function form: SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn FROM customers) ranked WHERE rn > 1;. On MySQL 5.7 and earlier where window functions do not exist, use the self-join: SELECT a.* FROM customers a JOIN customers b ON a.email = b.email AND a.id != b.id. Add an index on the duplicate-key column before running any of these on a large table; without it, Method 1 is full-scan, the window function buffers in memory, and the self-join goes quadratic.

Method 1: GROUP BY with HAVING

The shortest correct query. Works in MySQL 5.5, 5.6, 5.7, 8.0, and 8.4 (and anything older still in production).

SQL
SELECT email, COUNT(*) AS dup_count
FROM customers
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY dup_count DESC;

Two things to read out of the result:

  • Each row is a duplicate value, with a count of how many rows in the table carry it.
  • The line that does the filtering is HAVING COUNT(*) > 1. A common first guess is to write WHERE COUNT(*) > 1 instead, but that fails with an error because WHERE runs before GROUP BY, so the aggregate COUNT(*) does not exist yet at that point. HAVING is the filter that runs after aggregation, which is when aggregates are available.

To find duplicates across more than one column, list every column you care about in the GROUP BY:

Replace in this example:
SQL
SELECT first_name, last_name, email, COUNT(*) AS dup_count
FROM customers
GROUP BY first_name, last_name, email
HAVING COUNT(*) > 1;

Why are the same columns repeated in SELECT and GROUP BY? MySQL requires every non-aggregate column in SELECT to also appear in GROUP BY. This is the default since MySQL 5.7.5 (via ONLY_FULL_GROUP_BY in sql_mode). The reason: GROUP BY collapses many rows into one group, so MySQL needs to know which value to use for each column. The columns you grouped on are guaranteed identical within a group; columns you didn't group on are ambiguous, so the query errors out rather than picking arbitrarily.

This is the one to start with for an audit. It tells you how many duplicate groups exist and how bad the worst offenders are. What it does not give you is the row id of each individual duplicate; for that, see Method 2 or 3.

Method 2: Window functions (MySQL 8.0+)

If you are on MySQL 8.0 (released April 2018) or newer, including 8.1, 8.2, 8.3, and 8.4, the cleanest query uses ROW_NUMBER():

SQL
SELECT id, first_name, last_name, email, rn
FROM (
SELECT id, first_name, last_name, email,
       ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
FROM customers
) AS ranked
WHERE rn > 1;

Line by line: the inner query gives every row a number starting at 1 inside its email group, with the smallest id getting 1. The outer query keeps only the rows numbered 2 or higher, which is exactly "every duplicate, but not the one you would keep". This is the form you want when the next step is to remove the duplicates: How to Delete Duplicate Rows in MySQL shows how to wrap this pattern in a DELETE with safety checks, dry-run verification, and rollback.

Swap ORDER BY id for ORDER BY created_at if you want the oldest row in each group to be the keeper; swap to ORDER BY id DESC to keep the newest. The PARTITION BY clause is what makes the numbering restart for each group.

This query will fail on MySQL 5.5, 5.6, and 5.7 with a syntax error. Window functions were the headline feature of 8.0; before that, you need Method 1 or Method 3.

Method 3: Self-join

When you need every duplicate row with its full set of columns, and you are on MySQL 5.5, 5.6, 5.7, or any version that does not yet have window functions, join the table to itself on the duplicate key. The same query also works on 8.0 and up.

SQL
SELECT a.id, a.first_name, a.last_name, a.email
FROM customers a
JOIN customers b
ON a.email = b.email
AND a.id != b.id
ORDER BY a.email, a.id;

Every row in the result is one where another row exists with the same :column value but a different id. Conceptually similar to Method 2, but works on every MySQL version that supports JOIN (so, all of them).

The catch is performance: this is an O(n²) operation if MySQL has to scan the full table on each side of the join. On a 10,000 row table with no index on the join column you will wait. With a proper index it stays fast. More on that below.

Comparison: which method to pick

MethodWorks onReturnsUse when
GROUP BY + HAVINGAll versionsOne row per duplicate value, with countQuick audit. "How bad is it."
ROW_NUMBER8.0 and newerEvery duplicate row tagged with a positionYou are about to DELETE or UPDATE the dupes
Self-joinAll versionsEvery duplicate row with full columnsYou are on 5.7, or you want both rows of every pair

The other useful trick is EXISTS when all you need is a boolean ("does this table have any duplicates?"):

SQL
SELECT EXISTS(
SELECT 1
FROM customers
GROUP BY email
HAVING COUNT(*) > 1
) AS has_duplicates;

Returns 1 if any duplicates exist, 0 if not. Useful in monitoring queries and pre-migration sanity checks.

How NULL behaves in duplicate checks

The edge case that bites every junior dev once. GROUP BY puts all NULL values into one bucket and counts them as duplicates of each other. So if your column is nullable and a few rows have NULL, the query above reports them as one duplicate group.

A UNIQUE constraint in MySQL does the opposite: it allows multiple NULL rows in a unique column, because two NULLs are not considered equal under the constraint. So you can have a table where "find duplicates" reports a NULL group, but ALTER TABLE ... ADD UNIQUE still succeeds. Both behaviours are correct, and both have caught me at least once.

If you want the duplicate-finding query to skip NULLs entirely, add a WHERE clause:

SQL
SELECT email, COUNT(*) AS dup_count
FROM customers
WHERE email IS NOT NULL
GROUP BY email
HAVING COUNT(*) > 1;

If you want NULLs treated as their own value but distinct from each other (so they never collide), there is no clean SQL for it. You normally handle that at the application layer, or by normalising NULLs to a sentinel value before the audit.

Soft duplicates: case and whitespace

Look at the sample data again. ' emma.carter@example.com ' has a leading and trailing space; 'SOFIA.MAREK@example.com' is the same address as Sofia's other row in different case. Method 1 will not flag either as duplicates because byte-for-byte they are different strings.

To catch case differences, normalise both sides:

SQL
SELECT LOWER(email) AS norm, COUNT(*) AS dup_count
FROM customers
GROUP BY LOWER(email)
HAVING dup_count > 1;

To catch whitespace:

SQL
SELECT TRIM(email) AS norm, COUNT(*) AS dup_count
FROM customers
GROUP BY TRIM(email)
HAVING dup_count > 1;

Combine them when you suspect both:

SQL
SELECT LOWER(TRIM(email)) AS norm, COUNT(*) AS dup_count
FROM customers
GROUP BY norm
HAVING dup_count > 1;

There is a real cost here: wrapping the column in LOWER() or TRIM() makes MySQL unable to use an index on email, because the indexed values are not the function outputs. On large tables this turns a millisecond query into a multi-second one. Two ways out:

  1. Fix the data once, lock it down. Run a single migration that lowercases and trims every existing row, then apply normalisation in the application layer going forward.
  2. Use a case-insensitive collation on the column (for example utf8mb4_0900_ai_ci, the default in 8.0+). The collation handles case automatically and the index is still used.

I default to option 1 unless the column already has a _ci collation. For more complex matching where LOWER() and TRIM() aren't enough — pattern-style fuzzy detection like "addresses that differ only in subdomain" or "values that contain a specific substring" — MySQL's REGEXP operator lets you run regular expressions against column values; the Regex Cheat Sheet covers the pattern syntax that works inside SQL.

Performance and the index that matters

The single most important optimisation for any of these queries is an index on the column you are grouping or partitioning by.

SQL
CREATE INDEX idx_:table_email ON customers (email);

EXPLAIN before:

code
+----+-------------+-----------+------+---------------+-------+
| id | select_type | table     | type | possible_keys | rows  |
+----+-------------+-----------+------+---------------+-------+
|  1 | SIMPLE      | customers | ALL  | NULL          | 80213 |
+----+-------------+-----------+------+---------------+-------+

EXPLAIN after:

code
+----+-------------+-----------+-------+---------------------+-------+
| id | select_type | table     | type  | possible_keys       | rows  |
+----+-------------+-----------+-------+---------------------+-------+
|  1 | SIMPLE      | customers | index | idx_customers_email | 80213 |
+----+-------------+-----------+-------+---------------------+-------+

type: ALL is a full table scan. type: index is a fast index walk that hands the grouped values to MySQL pre-sorted. On the 80k-row customers table I started with, adding the index dropped the GROUP BY HAVING query from about 380 ms to under 20 ms on cold caches.

For the window-function version, a covering index that includes both the partition column and the order column lets MySQL avoid sorting in memory:

SQL
CREATE INDEX idx_:table_:column_id ON customers (email, id);

For the self-join version, the index on the join column is mandatory in practice. Without it MySQL does a nested-loop join that scales quadratically with table size, and the query goes from fast to "I'll come back tomorrow".

What to do next

Now that you have the list of duplicate rows, the natural next step is removing the extras while keeping one row per group. The same ROW_NUMBER pattern from Method 2 wraps cleanly into a DELETE, including the safest way to do it on a table with foreign keys: the full deletion walkthrough with dry-run, transaction wrapping, and adding the unique constraint afterwards.

If the duplicates are spread across two tables (rows in table A that have no equivalent in table B, or vice versa), How to Find Rows in One MySQL Table Not in Another covers the anti-join patterns. The foundation for understanding why a JOIN can synthesise duplicate rows in the first place is How to JOIN Two MySQL Tables.

If the duplicates arrived through a CSV import, How to Import a CSV File Into MySQL covers the staging-table pattern that prevents them from entering the main table in the first place; How to Export a MySQL Table to CSV is the matching export walkthrough. For a full backup before running any cleanup, How to Export or Back Up All MySQL Databases is the mysqldump --single-transaction --routines --events pattern.

For schema changes around the dedup (adding the column the unique key will live on, widening a key column to fit a deduplicated string), How to Add a Column to a MySQL Table, How to Change a MySQL Column Type, and the MySQL Field Types and Sizes reference cover those operations. If the duplicates involve DATETIME rounding, How to Format a MySQL DATETIME in a SELECT explains the precision differences.

For nightly duplicate audits driven from a cron job, Bash for loop and Bash while loop are the wrapper patterns.

For the wider MySQL command reference covering logins, dumps, joins, indexes, and string functions in one place, see the MySQL Cheat Sheet.

External references: official MySQL docs on GROUP BY and HAVING and window function syntax.

FAQ

TagsMySQLSQLDatabaseDuplicate DetectionGROUP BYWindow Functions
Share
Ishan Karunaratne

Ishan Karunaratne

Tech Architect · Software Engineer · AI/DevOps

Tech architect and software engineer with 20+ years across software, Linux systems, DevOps, and infrastructure — and a more recent focus on AI. Currently Chief Technology Officer at a tech startup in the healthcare space.

Keep reading

Related posts

Macro photograph of a stack of paper documents on a dark slate desk with a single sheet pulled out and crumpled to the side, warm amber side light

How to Delete Duplicate Rows in MySQL

Delete duplicate rows in MySQL while keeping one per group, using DELETE JOIN, ROW_NUMBER with CTE, or the safe temp-table swap. With dry-run, transactions, and rollback.

Macro photograph of a paper invoice with carbon-copy receipts being fed into an old card-reader on a dark slate desk, warm amber side lamp

How to Import a CSV File Into MySQL

Import a CSV into MySQL using LOAD DATA INFILE, LOAD DATA LOCAL INFILE, or the mysqlimport command. Covers header rows, encoding, the FILE privilege, and broken CSVs.