Data Analytics

How to find the bad data and fix it

When we are given any dataset, we first need to clean the data before we can analyze it. There are many considerations in this process. And there can be different outcome in different situations so one can't list steps to accomplish results for any dataset as it would be unique for each of them.

However, we have provided an example below that would help us to start the process and then asking questions or making some assumptions in order to decide what to do with this bad data.

In the dataset given below, there is a column stop_duration and we want to find out if there is any bad data in in that column and if so, how do we fix it

First, in order to find if there is any bad data, let's do value counts to understand data

DF.stop_duration.value_counts(dropna=False)

DF is the dataframe, stop_duration is the column and we are doing value counts including null values(that's why we used dropna=False in the bracket)

Apparently this column has duration classified in segments of 0-15 min, 16-30 min, 30+ min

NaN represents null values which means those values were not recorded- either due to a data capture error or due to data corruption or for some other reason. If we are not sure, we can ask the data source if this NaN values may be genuine for any reason.

But, 1 and 2 are certainly looking like a bad data here. In this scenario, if we have to assume what they could be, stretch your imagination and think what they possible could be.

1 may mean1 hour and 2 may mean 2 hour?
Is that a typo?
Since there are only 2 entries(1 each) like this in the entire column, we can assume that this may be bad data?

At this point, since there are only 1 record only for 1 and 2, it is safe to assume that it is a bad data and we need to fix it. If there were let's say, 1000 each for 1 and 2, then situation would be different as we just can't assume that it is an erroneous data and we need to ask more question to the data source to understand what could they mean before we decide what to do with it.

Possible fix suggestions?