How to clean data using SQL

 What are possible causes of bad data that we may have to take in to account so we can focus on cleaning effort to make the data sparkling clean...

  • Duplicates
  • Truncated data
  • Extra spaces and character
  • Null data/Missing data
  • Mistyped numbers
  • Inconsistent string
  • Inconsistent date format
  • Misleading column names
  • Mismatched data type
  • Misspelt words

______________________________________________

Using DISTINCT clean duplicates and inconsistent values

Let's say if a product has only 4 colours and you find out that there are 6 colours in the colour column in the product table

SELECT DISTINCT colour

FROM product

Or, let's say that if colour column has number from 0 to 9 that represents different colours of the product and you want to find out if there is no double digit colour, then use LEN function to find the length of the numbers in the column

SELECT LEN(colour)

FROM product

_____________________________________

Using IS NULL, find the null values from any column

SELECT first_name,last_name

FROM customer

WHERE last_name IS NULL

This query will give us all first names where last names are missing

______________________________________________

Using SUBSTR, find the odd value out

Using SELECT SUBSTR(city,1,3) FROM table_name we can select only first three characters from city column

______________________________________________

Using TRIM, remove unwanted spaces from a string in a column

Below action will remove all leading and trailing spaces from first_name and last_name columns in customer table

UPDATE customer

 SET

        first_name = TRIM(first_name)

        last_name = TRIM(last_name);


______________________________________________

Once you find the correct last name, you can replace it using UPDATE function

UPDATE customer

SET last_name = 'Smith'

WHERE first_name = 'Roger'

If there are more than 1 roger in the list then, you can use customer_id instead of first_name to make this correction

______________________________________________

Often, we are provided with description of the data where you can see max and min values (Range) in a column

Check by using MAX(column_name) and MIN(column_name) to ensure that data des not have incorrect values

For example, if we know that age column in the students table can have values between 10 and 12, then using MAX(age) and MIN(age) we can find if there are any incorrect values

______________________________________________

Once we know there are values less or higher than the expected range, using COUNT function, we can find out how many of these students exist in the list with incorrect age

SELECT COUNT  *  

FROM students

WHERE age = 13 AND age = 9

______________________________________________

If it appears that there is a significant number of incorrect data in a column, then consult data source for further guideline...should we correct it(find the correct data or replace incorrect values with average values in a column) or delete it from the table using DELETE function

DELETE students

WHERE age = 13 and age = 9

______________________________________________

If 'age' column has been assigned string data type but the column has numbers, then using CAST function, we can change the data type of column 'age'

SELECT CAST(age AS integer) 

FROM students

ORDER BY CAST(age AS integer)

______________________________________________

Some times we need to combine data in 2 columns in 1 column, then we can use CONCAT

For example, date and time are in separate column and we want t combine them together in 1 column. 

SELECT CONCAT (column1,column2) AS new_column_name

FROM table_name

______________________________________________

Using COALESCE, we can avoid null values in a column by returning non null values in a list

In product table, we have product code in one column and product name in another column and in one row, you have product name available but product code missing. And we have been told that product code is not available at the moment but we can't remove that row as it is an important product.




So whenever data is pulled for product code, we want to show product in the product_id column instead of null value

SELECT COALESCE(product_code,product_name) 

FROM product


______________________________________________

Using CASE statement, we can replace incorrect string with corrected string

SELECT customer_id,

CASE

WHEN first_name = 'Jonh' THEN 'John'

WHEN first_name = 'Rajeev' THEN 'Rajiv'

END AS corrected_name

FROM table_name

______________________________________________




No comments:

Post a Comment

Complex query example

Requirement:  You are given the table with titles of recipes from a cookbook and their page numbers. You are asked to represent how the reci...