Basic data cleaning steps in EXCEL
When we get the data spreadsheet, after understanding what data is included in the spreadsheet and its meaning, we now have to ensure that data is complete and proper for further analysis.
Consider few things that can make your task easier and will ensure that you don't miss any important thing.
- Size of the data(Rows/Columns) and number of spreadsheets
- Contents of the data...what is the data about
- Are you able to quickly see anything related in these columns?
- Missing data (Empty cell)
- Unformatted data (Extra spaced in cell)
- Different data type (Example: 75 in one row and seventy five in another)
In EXCEL, we can follow some basic steps to ensure that data is aligned in each cells.
- Find empty cells
- Remove duplicates
- Update date format
- Text to column
- Use COUNTIF to find outliers or abnormal values
- Use LEN to ensure character lengths are in order
- Use LEFT or RIGHT to separate text as per requirement
- Use MID to separate middle characters from alpha numeric characters
- Use CONCATENATE to join text in 2 different columns
- Use TRIM to remove extra spaces from cells
- Use spelling check function to ensure text is spelled correctly
- Plotting- Sometimes, when you want to take a high level view of the numbers presented in the data, highlight the column and plot it. This way if there is any odd number out there, you would know immediately and then can ask the data source to check if the value is right or not.
- Use Data Mapping if you have data in 2 different spreadsheets and you need to combine them together. Make sure that both have similar columns...For example, in one spreadsheet, city is included in the address column along with street number and address. In another spreadsheet, city is in a separate column and address column has street number and name. You need to use CONCATENATE or text to column and make sure both spreadsheet have similar columns before combining the data.
- Also, one spreadsheet has province column that has short abbreviation of province (ON) but another spreadsheet has province name written in full(Ontario)
- Use FIND/REPLACE function to correct misspelt data. This is required for names of customer or product names where spell check may not find any errors
How to use pivot table to identify insights in EXCEL
We know that pivot tables are useful to segregate and group data in a meaningful way so we can identify insights and generate new questions and answers for the data.
Let's take an example to understand in a simple way how it is done
We are given a data for movies.
We are given some basic questions by stakeholders to start with the analysis.
- How much revenue generated each year from movies?
- What is the average revenue per movie each year?
That's a good direction for a start. Let's create a pivot table to find out answers.

2015 has the lowest average revenue per movie...why?
Now we will focus on 2015 data.
It is evident that 2015 had released higher number of movies (124) than any other year. Hence an obvious thought would be that it reduced the average because of that.
But is it possible that it has higher number of movies with low box office collection?
We don't know yet so let's try to find that out.
I took $10 million as a round figure from the box office revenue column to find out how many movies in out list had collection of less than $10 million...
Why I took $10 million limit for filtering data?
I can ask the stake holder to quantify what they consider as low revenue movie based on industry standard. Or (as I did in this case) I figured out that all movies less than $10 million revenue were breakeven for investors.
Meaning, average budget for movies with revenue less than $10 million was approx. $10 million. So investors did not make any money if revenue was less than $10 million.
That's why I took $10 million as a round figure to establish a movie as low revenue movie.
Using filter function in pivot table, I have filtered count of all movies with less than $10 million revenue.
Now with the support of data, I can say with confidence that 2015 had higher percentage (16.13%) of low revenue movies and that's why average revenue per movie was lowest in that year.
How to use SUMPRODUCT function in EXCEL
SUMPRODUCT can save us lot of time as it does calculation in multiple ways.
We are given data for Quantity, Price and Margin and we want to find out the total revenue and total profit
Total revenue = SUM(Quantity X Unit Price)
Company's total revenue is a sum of revenues from all products.
We can do this in EXCEL cell by cell using formula in each row. But SUMPRODUCT does it all together at once and hence saving time.
SUMPRODUCT automatically does calculation of quantity X Price and then sum it up for all rows and gives us the answer. We just need to select range of data in each column - quantity and Unit Price
Same way we can also calculate profit in this data.
By selecting range in 3 columns- Quantity, Unit Price and Margin we can get the profit number quickly
How to SUM number of consecutive rows or columns in EXCEL
We have the data for the stocks of Apples and Mangoes at a grocery store.
We want to get a total for all Apples and Total for all mangoes in the store.
Here since the data is small, we can use SUM ( ) and get the answer but in a large data set, we can get this result quickly using OFFSET formula along with SUM
OFFSET helps to identify location and number of cells to add and then SUM does the addition.
Taking E5 as a reference to start locating rows we need for our calculation, our first row to start calculating from is E6. Hence in OFFSET formula, we add E5 as reference cell and then 1st row as start point. Since we are not considering to add any columns, we would add 0 for the column and then we need 4 rows from our reference row, hence would add 4 for the height in the OFFSET formula to locate our data of 4 rows.
=OFFSET(E5,1,0,4)
And then using SUM (OFFSET(E5,1,0,4)) we can get total for this 4 rows.
It is very common to use this formula while calculating weekly, monthly, quarterly or yearly totals or averages. It is also popularly used in finding running totals.
Using the same method, we can get totals in columns also.
No comments:
Post a Comment