What are possible causes of bad data that we may have to take in to account so we can focus on cleaning effort to make the data sparkling clean...
- Duplicates
- Truncated data
- Extra spaces and character
- Null data/Missing data
- Mistyped numbers
- Inconsistent string
- Inconsistent date format
- Misleading column names
- Mismatched data type
- Misspelt words
______________________________________________
Using DISTINCT clean duplicates and inconsistent values
Let's say if a product has only 4 colours and you find out that there are 6 colours in the colour column in the product table
SELECT DISTINCT colour
FROM product
Or, let's say that if colour column has number from 0 to 9 that represents different colours of the product and you want to find out if there is no double digit colour, then use LEN function to find the length of the numbers in the column
SELECT LEN(colour)
FROM product
_____________________________________
Using IS NULL, find the null values from any column
SELECT first_name,last_name
FROM customer
WHERE last_name IS NULL
This query will give us all first names where last names are missing
______________________________________________
Using SUBSTR, find the odd value out
Using SELECT SUBSTR(city,1,3) FROM table_name we can select only first three characters from city column
______________________________________________
Using TRIM, remove unwanted spaces from a string in a column
Below action will remove all leading and trailing spaces from first_name and last_name columns in customer table
UPDATE customer
SET
first_name = TRIM(first_name)
last_name = TRIM(last_name);
______________________________________________
Once you find the correct last name, you can replace it using UPDATE function
UPDATE customer
SET last_name = 'Smith'
WHERE first_name = 'Roger'
If there are more than 1 roger in the list then, you can use customer_id instead of first_name to make this correction
______________________________________________
Often, we are provided with description of the data where you can see max and min values (Range) in a column
Check by using MAX(column_name) and MIN(column_name) to ensure that data des not have incorrect values
For example, if we know that age column in the students table can have values between 10 and 12, then using MAX(age) and MIN(age) we can find if there are any incorrect values
______________________________________________
Once we know there are values less or higher than the expected range, using COUNT function, we can find out how many of these students exist in the list with incorrect age
SELECT COUNT *
FROM students
WHERE age = 13 AND age = 9
______________________________________________
If it appears that there is a significant number of incorrect data in a column, then consult data source for further guideline...should we correct it(find the correct data or replace incorrect values with average values in a column) or delete it from the table using DELETE function
DELETE students
WHERE age = 13 and age = 9
______________________________________________
If 'age' column has been assigned string data type but the column has numbers, then using CAST function, we can change the data type of column 'age'
SELECT CAST(age AS integer)
FROM students
ORDER BY CAST(age AS integer)
______________________________________________
Some times we need to combine data in 2 columns in 1 column, then we can use CONCAT
For example, date and time are in separate column and we want t combine them together in 1 column.
SELECT CONCAT (column1,column2) AS new_column_name
FROM table_name
______________________________________________
Using COALESCE, we can avoid null values in a column by returning non null values in a list
In product table, we have product code in one column and product name in another column and in one row, you have product name available but product code missing. And we have been told that product code is not available at the moment but we can't remove that row as it is an important product.
So whenever data is pulled for product code, we want to show product in the product_id column instead of null value
SELECT COALESCE(product_code,product_name)
FROM product
______________________________________________
Using CASE statement, we can replace incorrect string with corrected string
SELECT customer_id,
CASE
WHEN first_name = 'Jonh' THEN 'John'
WHEN first_name = 'Rajeev' THEN 'Rajiv'
END AS corrected_name
FROM table_name
______________________________________________
No comments:
Post a Comment