In this article, I will show you how we can identify duplicate data in a mysql or maria db database and how we can remove the duplicates. Although we can set constraints to make certain columns accept unique fields, in this example we are going to leave these contraints out as in the real world sometimes these aren't added in or we're working with legacy data.
Let's create our test database table. It is best to use a fresh database server for this, but if you cannot, please ensure you haven't already got a table called duplicate_users_test
as our script will delete it.
DROP TABLE IF EXISTS duplicate_users_test; CREATE TABLE duplicate_users_test( id INT PRIMARY KEY AUTO_INCREMENT, first_name VARCHAR(50) NOT NULL, last_name VARCHAR(50) NOT NULL, email VARCHAR(255) NOT NULL);
Now the table is created, we'll insert the following data. You will note that I have commented where the duplicates are in the data.
INSERT INTO duplicate_users_test (first_name, last_name, email)VALUES ('Craig', 'Robertson', 'craig.robertson@example.com'), ('Stefan', 'Taylor', 'stefan.taylor@example.com'), ('Abbie', 'Murphy', 'abbie.murphy@example.com'), ('Abbie', 'Murphy', 'abbie.murphy@example.com'), -- Duplicate ('Abbie', 'Murphy', 'abbie.murphy@example.com'), -- Duplicate ('Fiona', 'Jones', 'fiona.jones@example.com'), ('Dave', 'Carter', 'dave.carter@example.com'), ('Harry', 'Palmer', 'harry.palmer@example.com'), ('Harry', 'Palmer', 'harry.palmer@example.com'), -- Duplicate ('Tara', 'Johnson', 'tara.johnson@example.com'), ('Jack', 'Brown', 'jack.brown@example.com'), ('Abbie', 'Cox', 'abbie.cox@example.com'), ('Lilly', 'Thompson', 'lilly.thompson@example.com');
Now we've got some data inserted into the database we can start querying it. It is quite possible that you won't know if there are duplicates in your database table but finding instances of duplicate data is not to hard to achieve.
Let's see if we have any duplicate data in our email column. We will select the email column and also a count of each row found from the duplicate_users_test table. We will group by the email column so only one row per unique email address will show which will make our count work correctly. And then lastly, we will use a having
clause (which is similar to where) to say only show us records which have a count over 1, i.e. email addresses which occur more than once.
SELECT email, COUNT(*) as `count`FROM duplicate_users_testGROUP BY emailHAVING COUNT(email) > 1;
Which outputs the following;
+--------------------------+-------+| email | count |+--------------------------+-------+| abbie.murphy@example.com | 3 || harry.palmer@example.com | 2 |+--------------------------+-------+
There maybe times when you just don't want to have duplicate information in your database. For example, let's assume our duplicate_users_test
table is responsible for holding email addresses to send out our monthly newsletter. Abbie and Harry would receive 3 and 2 copies each respectively.
There are several ways we can remove duplicates. Let's take a look at them;
DELETE users1 FROM duplicate_users_test users1INNER JOIN duplicate_users_test users2WHERE users1.id > users2.id AND users1.email = users2.email;
⚠️ To use the ROW_NUMBER() function you will need MySQL version 8.02 or greater.
ROW_NUMBER() assigns a unique number to each row to which it is applied. In this instance we are saying give us a row number using email as a partition.
If we do a select
SELECT id, email, ROW_NUMBER() OVER ( PARTITION BY email ORDER BY email ) AS row_numFROM duplicate_users_test;
you will now see we have a row number on our duplicates which increases by one each entry.
+----+-----------------------------+---------+| id | email | row_num |+----+-----------------------------+---------+| 12 | abbie.cox@example.com | 1 || 3 | abbie.murphy@example.com | 1 || 4 | abbie.murphy@example.com | 2 | ← increments by 1| 5 | abbie.murphy@example.com | 3 | ← increments by 1| 1 | craig.robertson@example.com | 1 || 7 | dave.carter@example.com | 1 || 6 | fiona.jones@example.com | 1 || 8 | harry.palmer@example.com | 1 || 9 | harry.palmer@example.com | 2 | ← increments by 1| 11 | jack.brown@example.com | 1 || 13 | lilly.thompson@example.com | 1 || 2 | stefan.taylor@example.com | 1 || 10 | tara.johnson@example.com | 1 |+----+-----------------------------+---------+
So now we can use the the row_num to delete anything which is over 1 to remove the duplicates.
DELETEFROM duplicate_users_testWHERE id IN (SELECT id FROM (SELECT id, ROW_NUMBER() OVER ( PARTITION BY email ORDER BY email) AS row_num FROM duplicate_users_test) t WHERE row_num > 1);