How to find the bad data and fix it

When we are given any dataset, we first need to clean the data before we can analyze it. There are many considerations in this process. And there can be different outcome in different situations so one can't list steps to accomplish results for any dataset as it would be unique for each of them.

However, we have provided an example below that would help us to start the process and then asking questions or making some assumptions in order to decide what to do with this bad data.

In the dataset given below, there is a column stop_duration and we want to find out if there is any bad data in in that column and if so, how do we fix it


First, in order to find if there is any bad data, let's do value counts to understand data

DF.stop_duration.value_counts(dropna=False)

DF is the dataframe, stop_duration is the column and we are doing value counts including null values(that's why we used dropna=False in the bracket)


Apparently this column has duration classified in segments of 0-15 min, 16-30 min, 30+ min

NaN represents null values which means those values were not recorded- either due to a data capture error or due to data corruption or for some other reason. If we are not sure, we can ask the data source if this NaN values may be genuine for any reason.

But, 1 and 2 are certainly looking like a bad data here. In this scenario, if we have to assume what they could be, stretch your imagination and think what they possible could be.

  • 1 may mean1 hour and 2 may mean 2 hour?
  • Is that a typo?
  • Since there are only 2 entries(1 each) like this in the entire column, we can assume that this may be bad data?

At this point, since there are only 1 record only for 1 and 2, it is safe to assume that it is a bad data and we need to fix it. If there were let's say, 1000 each for 1 and 2, then situation would be different as we just can't assume that it is an erroneous data and we need to ask more question to the data source to understand what could they mean before we decide what to do with it.

Possible fix suggestions?

  • 1 can be interpreted as 60 min and 2 as 120 min and add them into 30+ min records
  • Mark 1 and 2 as null values
  • Remove rows that has 1 and 2

For this one, we decide to mark 1 and 2 as null values as it occurred only 1 time each.

Since NaN is a numpy function, ensure that you import numpy library too along with pandas to accomplish this code

DF.loc[(DF.stop_duration=='1')|(DF.stop_duration=='2'),'stop_duration']=np.NaN


As you can see now that 1 and 2 are no more showing up in value_counts and NaN has changed from 5333 to 5335

This is just one of many scenarios how we can find bad data and fix it

No comments:

Post a Comment

Complex query example

Requirement:  You are given the table with titles of recipes from a cookbook and their page numbers. You are asked to represent how the reci...