databases mysql

Removing Duplicate Entries

Written on 9th Aug 2022

In this article, I will show you how we can identify duplicate data in a mysql or maria db database and how we can remove the duplicates. Although we can set constraints to make certain columns accept unique fields, in this example we are going to leave these contraints out as in the real world sometimes these aren't added in or we're working with legacy data.

Let's create our test database table. It is best to use a fresh database server for this, but if you cannot, please ensure you haven't already got a table called duplicate_users_test as our script will delete it.

DROP TABLE IF EXISTS duplicate_users_test;
 
CREATE TABLE duplicate_users_test
(
    id         INT PRIMARY KEY AUTO_INCREMENT,
    first_name VARCHAR(50)  NOT NULL,
    last_name  VARCHAR(50)  NOT NULL,
    email      VARCHAR(255) NOT NULL
);

Now the table is created, we'll insert the following data. You will note that I have commented where the duplicates are in the data.

INSERT INTO duplicate_users_test (first_name, last_name, email)
VALUES ('Craig', 'Robertson', 'craig.robertson@example.com'),
       ('Stefan', 'Taylor', 'stefan.taylor@example.com'),
       ('Abbie', 'Murphy', 'abbie.murphy@example.com'),
       ('Abbie', 'Murphy', 'abbie.murphy@example.com'), -- Duplicate
       ('Abbie', 'Murphy', 'abbie.murphy@example.com'), -- Duplicate
       ('Fiona', 'Jones', 'fiona.jones@example.com'),
       ('Dave', 'Carter', 'dave.carter@example.com'),
       ('Harry', 'Palmer', 'harry.palmer@example.com'),
       ('Harry', 'Palmer', 'harry.palmer@example.com'), -- Duplicate
       ('Tara', 'Johnson', 'tara.johnson@example.com'),
       ('Jack', 'Brown', 'jack.brown@example.com'),
       ('Abbie', 'Cox', 'abbie.cox@example.com'),
       ('Lilly', 'Thompson', 'lilly.thompson@example.com');

Finding the duplicates

Now we've got some data inserted into the database we can start querying it. It is quite possible that you won't know if there are duplicates in your database table but finding instances of duplicate data is not to hard to achieve.

Let's see if we have any duplicate data in our email column. We will select the email column and also a count of each row found from the duplicate_users_test table. We will group by the email column so only one row per unique email address will show which will make our count work correctly. And then lastly, we will use a having clause (which is similar to where) to say only show us records which have a count over 1, i.e. email addresses which occur more than once.

SELECT email,
       COUNT(*) as `count`
FROM duplicate_users_test
GROUP BY email
HAVING COUNT(email) > 1;

Which outputs the following;

+--------------------------+-------+
| email                    | count |
+--------------------------+-------+
| abbie.murphy@example.com |     3 |
| harry.palmer@example.com |     2 |
+--------------------------+-------+

Removing duplicates

There maybe times when you just don't want to have duplicate information in your database. For example, let's assume our duplicate_users_test table is responsible for holding email addresses to send out our monthly newsletter. Abbie and Harry would receive 3 and 2 copies each respectively.

There are several ways we can remove duplicates. Let's take a look at them;

Remove duplicate rows using DELETE JOIN statement

DELETE users1 FROM duplicate_users_test users1
INNER JOIN duplicate_users_test users2
WHERE
    users1.id > users2.id AND
    users1.email = users2.email;

Remove duplicates using ROW_NUMBER() function

⚠️ To use the ROW_NUMBER() function you will need MySQL version 8.02 or greater.

ROW_NUMBER() assigns a unique number to each row to which it is applied. In this instance we are saying give us a row number using email as a partition.

If we do a select

SELECT
    id,
    email,
    ROW_NUMBER() OVER (
        PARTITION BY email
        ORDER BY email
    ) AS row_num
FROM duplicate_users_test;

you will now see we have a row number on our duplicates which increases by one each entry.

+----+-----------------------------+---------+
| id | email                       | row_num |
+----+-----------------------------+---------+
| 12 | abbie.cox@example.com       |       1 |
|  3 | abbie.murphy@example.com    |       1 |
|  4 | abbie.murphy@example.com    |       2 | ← increments by 1
|  5 | abbie.murphy@example.com    |       3 | ← increments by 1
|  1 | craig.robertson@example.com |       1 |
|  7 | dave.carter@example.com     |       1 |
|  6 | fiona.jones@example.com     |       1 |
|  8 | harry.palmer@example.com    |       1 |
|  9 | harry.palmer@example.com    |       2 | ← increments by 1
| 11 | jack.brown@example.com      |       1 |
| 13 | lilly.thompson@example.com  |       1 |
|  2 | stefan.taylor@example.com   |       1 |
| 10 | tara.johnson@example.com    |       1 |
+----+-----------------------------+---------+

So now we can use the the row_num to delete anything which is over 1 to remove the duplicates.

DELETE
FROM duplicate_users_test
WHERE id IN (SELECT id
             FROM (SELECT id,
                          ROW_NUMBER() OVER (
                              PARTITION BY email
                              ORDER BY email) AS row_num
                   FROM duplicate_users_test) t
             WHERE row_num > 1);

Code highlighting provided by torchlight.dev.