Data Analytics: How to find duplicate values in SQL using window function

We are given email table which includes user_id, user_name and email

and our task is to find out duplicate record from this table.

One way to do this would be by using a simple GROUP BY function

Since user_id is a primary key and user_name is supposed to be unique for each customer, we can check if there are any duplicates there.

First, Let's check for user_id

There are no duplicates found in user_id column.

Now, let's check user_name column

We found duplicate user_name Keeny so that is the record we will check and find out why is it duplicate and make correction if required(same user_name but different email address) or remove it from the list (same user_name and email address)

How can we get this result using windows function?

Idea:

We can add row number column and assign row numbers to these records. Row numbers are going to be in order from 1 to 5 (sale as user_id).

And then if we partition them by user_name (which should be unique for each user), and if there are any duplicate user_name then row number would be different for same user_name.

Something like this...

And now, it is an easy task to pick a row number the one which is not unique

So we know that Keeny is a duplicate user_name.

Using this method we can find any duplicates in the column. Just have to make sure that the column that we are taking as a reference (user_name in this case) is supposed to have unique records.

In this table, user_id is also unique and user_name is also unique. But after checking for duplicates in user_id, we could not get any duplicates so we checked for the other unique value column and found duplicate.

In case if we did not get duplicate using user_name then we could check duplicate in email column also as each user_name is supposed to have a unique email address.

Data Analytics

Pages

How to find duplicate values in SQL using window function

No comments:

Post a Comment

Complex query example

About Me