Data Analytics: Python

Ways to import files in pandas (Python)

We all know that pd.read_csv is the code that is used to import file to python.

Sometimes, for various reasons, we do not need to import the complete file but only some of it. There are 5 standard functions we can use to accomplish this.

For example, we have imported this entire dataset to python using pd.read_csv function.

1) Import data by specifying column names

2) Choosing usecols function to keep only those columns that are needed

3) Specify data types of the columns that you want to import

4) Using nrows function, read only specified number of rows instead of entire dataset

5) Using na_values function to recognize strings as NaN

Now, instead of importing the entire dataset, we are going to import only 4 columns that are needed for our analysis.

df = pd.read_csv (r'Downloads\police.csv',usecols=['stop_date','driver_age','violation','stop_duration'],dtype={'violation':'category'})

By importing only columns that we need, memory is saved. And if certain column has items that can consume more memory, at the time of importing that column, if we convert data types to 'category', it will save memory.

Another way to manipulate data for a dataset that has millions of rows, is to import only some rows.

In the above example, dataset has 97000+ rows. If we want to import only 1000 for now to get a feel of the data, we can use nrows=1000 function to accomplish this.

df = pd.read_csv(r'Downloads\police.csv',usecols=['stop_date','driver_age','violation','stop_duration'],dtype={'violation':'category'},nrows=1000)

If the dataset has rows with no values and we want to mark them as NaN so we can clean this while doing data wrangling, we can use na_values function at the time of importing dataset. It will identify missing data and mark it as NaN and import it.

__________________________________________________________________________________

Python Data types

The concept of data type exists in any programming language. The simple understanding of it is that any variable stores data of any type. For example, integer number, decimal number, text etc.

While using python, different types of data is known as data type.

Data types (int, float, str, boolian)

X = 6 contains an integer number is known as data type int

X = 6.5 contains a number with decimal is known as data type float

X = contains text 'Today is a great day' is known as data type string str

X = True contains close ended decisive data yes/no is known as data type bool

_________________________________________________________________________________

How to create sample dataset using pandas and numpy

Sometimes, we just want to try something on a dataframe and we don't have any other dataframe available. In this situation, we can quickly create a dataframe sample using pandas pd.dataframe function and we can add some random data into this dataframe using numpy's np.rand function.

First we import both pandas and numpy libraries

Using pandas alone...

df = pd.DataFrame({'apple':[100,200,300],'orange':[400,500,600]})

Using pandas and numpy both...

To make more rows and columns, just provide under rand function as

rand(rows,columns) and names of columns under list(column names)

df1 = pd.DataFrame(np.random.rand(10,8),columns=list('ABCDEFGH'))

df1

_________________________________________________________________________________

Conditional statement in python (if, elseif, else)

Check the below code and see what can you make out of it

x = 8

if x > 0:

print('it is true')

elif x == 0:

print('it is zero')

else:

print('unknown')

it is true

if is a conditional statement

elif (else if) provides you option to verify any other conditions and we can use elif as many times as we need to.

else provides you with the last option if any of the defined conditions are not true

Let's say in your dataframe, there is a column for province and it lists province name for that row.

For your analysis, you don't need to go by every province but you decide to create 3 regions from these provinces. Eastern, Central and Western

And you want to list provinces under each region

Eastern

new_brunswick, nova_scotia, prince_edward_island

Central

ontario, quebec, manitoba

Western

alberta, british_columbia, saskatchewan

Using conditional statement, you can create a new column 'region'

if (province=='alberta') or (province == 'british_columbia') or (province == 'saskatchewan'):

region.append('western_canada')

elif (province=='ontario') or (province == 'manitoba') or (province == 'quebec'):

region.append('central_canada')

else:

region.append('eastern_canada')

As we can see, conditional statements are very useful in classifying or grouping the data into segments which can be easily visualized.

________________________________________________

How to define customized function in python

Although, python has a large number of standard functions to execute the calculation or logic we need to apply to our dataset, there are always going to be some situations when we need to create a customized calculation logic to arrive at our desired results.

In this scenario, we can create our own function using def(define) code.

A simple example is given below

Let's say, we want to make a calculator which can add the two values we enter.

Simple calculation function that we can create would be like this...

def calculate(x, y, operation):

if operation == 'add':

return x+y

To test this function,

If add is the only transaction we need from this calculator, then we can define it as a default operation

def calculate(x, y, operation = 'add'):

return x+y

This will return us the result

As you can see that by setting a default operation as 'add', we do not need to enter it when we write the code.

But for it to carry out multiple calculations like subtract, multiply, division etc. we have to write conditions using conditional statements if, elif and else

def calculate(a,b,operation='add'):

if operation == 'add':

return a+b

elif operation == 'subtract':

return a-b

elif operation == 'multiply':

return a*b

elif operation == 'percent':

return a/b*100

else:

return 'Error'

Here we are giving multiple conditions so calculator can perform the task we need it to execute

We can see here that since division is not defined, it gives back Error

This is the simple application of executing our own function to meet our need

Another example is...

Let's say you are working in the payroll and you are asked to calculate pay for employees in such a way that when they work 40 hours or less, they get paid as per their standard rate but once they work any additional hours after 40, they need to be paid over time at the rate of 1.5 times their regular hour

And company pays a flat rate of $100 per hour to all contractors and hence if there is no hourly rate mentioned, then your calculation should consider default hourly rate of 100 and contractors are not paid over time irrespective of hours they work

Let's try to build a customized function for this requirement

def total_pay(hours, rate_per_hour = 100):

if hours <= 40:

return hours*rate_per_hour

elif rate_per_hour == 100:

return hours*rate_per_hour

else:

base_pay = 40*rate_per_hour

overtime_pay = (rate_per_hour*1.5) * (hours-40)

return base_pay + overtime_pay

As you can see that when hours worked are n=more than 40, we are going to calculate 1.5 times rate for only those hours worked after 40

As you can see, for employee whose rate is mentioned as $20 per hours, gets paid 800 for 40 hours (40X20) and for extra 1 hour worked, gets paid for $30(1.5X20X1)

But for contractor working for 41 hours only a flat rate of $100 is considered and gets paid $4100

Defining a customized function is a very useful feature of python that we can use frequently for getting the results we want.

__________________________________________________________________________________

How to find the bad data and fix it

When we are given any dataset, we first need to clean the data before we can analyze it. There are many considerations in this process. And there can be different outcome in different situations so one can't list steps to accomplish results for any dataset as it would be unique for each of them.

However, we have provided an example below that would help us to start the process and then asking questions or making some assumptions in order to decide what to do with this bad data.

In the dataset given below, there is a column stop_duration and we want to find out if there is any bad data in in that column and if so, how do we fix it

First, in order to find if there is any bad data, let's do value counts to understand data

DF.stop_duration.value_counts(dropna=False)

DF is the dataframe, stop_duration is the column and we are doing value counts including null values(that's why we used dropna=False in the bracket)

Apparently this column has duration classified in segments of 0-15 min, 16-30 min, 30+ min

NaN represents null values which means those values were not recorded- either due to a data capture error or due to data corruption or for some other reason. If we are not sure, we can ask the data source if this NaN values may be genuine for any reason.

But, 1 and 2 are certainly looking like a bad data here. In this scenario, if we have to assume what they could be, stretch your imagination and think what they possible could be.

1 may mean1 hour and 2 may mean 2 hour?
Is that a typo?
Since there are only 2 entries(1 each) like this in the entire column, we can assume that this may be bad data?

At this point, since there are only 1 record only for 1 and 2, it is safe to assume that it is a bad data and we need to fix it. If there were let's say, 1000 each for 1 and 2, then situation would be different as we just can't assume that it is an erroneous data and we need to ask more question to the data source to understand what could they mean before we decide what to do with it.

Possible fix suggestions?

1 can be interpreted as 60 min and 2 as 120 min and add them into 30+ min records
Mark 1 and 2 as null values
Remove rows that has 1 and 2

For this one, we decide to mark 1 and 2 as null values as it occurred only 1 time each.

Since NaN is a numpy function, ensure that you import numpy library too along with pandas to accomplish this code

DF.loc[(DF.stop_duration=='1')|(DF.stop_duration=='2'),'stop_duration']=np.NaN

As you can see now that 1 and 2 are no more showing up in value_counts and NaN has changed from 5333 to 5335

This is just one of many scenarios how we can find bad data and fix it

________________________________________________

How to use LAST function

We have grocery purchase data for this month and one of the fast moving item is eggs. We have a list of customers who buy eggs regularly and we want know when was the last date customers bought eggs this month.

We can see that last date of buying eggs for customers Patel, Raval and Shah was 21,17 and 29

But how to run a code that can give this information?

eggs.groupby('customer').Date.last()

Here, eggs is our dataframe, customer and Date are columns

This is a very useful function and it is helpful in getting a list of customers who have not bought eggs for certain no of days so they can be contacted by a campaign.

__________________________________________________________________________

For Loop vs. List Comprehension

In python, same result can be accomplished by writing the code in different ways. We can use whichever code we are more comfortable with.

But while working in a company, if there are people who feel comfortable writing the code in different ways, then it is important for us to know different ways of writing a code as you may have to read someone else's code who is not used to write it your way!

Let's check by an example:

A simple for loop code is...

fruits = ['orange', 'mango', 'pineapple']

For getting the same result, we can also write it using list comprehension

For getting this result, we need not to assign the list to anything. But if we want to, we can do that too.

We can also slice the list

Or even use conditions inside the list

Another conditional expression

_________________________________________________________________________________

Using transform( ) in python

We are given below data from an online grocery store and we are trying to create a new column which gives accumulated values for an order. user_id is customer

This customer (user_id=1) has placed total 11 orders so far and each order has unique order_id.

We are trying to find out total no. of orders placed by customers in a separate column, so we know which customers are our top 100 customers based on no. of orders placed.

One way to obtain this information is by using count function.

orders.groupby('user_id').order_number.count()

So it means cust-1 placed 11 orders, cust 2 placed 15 orders, and so on...

This is fine but we want to show it as a separate column on orders dataframe so we can see it along with other data.

orders['ord_count'] = orders.groupby('user_id').order_number.count()

As we can see, this does not give the correct presentation of the data.

So, another way to accomplish this is by using transform( )

orders['order_count'] = orders.groupby('user_id').order_number.transform('count')

And now we can see that the total order count is presented properly.

_________________________________________

What is nested data in python?

In simple term, data within the data is generally called a nested data.

A list within a list or a table within a table are examples of nested data.

Imagine a complex and extensive codes. You can understand that at times it is difficult to understand it quickly. Hence, a good practice is to divide in to smaller groups and find their meaning so gradually you can start unwrapping the whole code.

One simple example is the nested dictionary data.

Now we can add a list within this dictionary to make it a nested data

We are adding cathy under kids.

So now this becomes a nested data. List within dictionary.

And now since kids is a list, we can apply all functions that are applicable to list in order to access data within this list.

In this dictionary, there are key-value pairs. dad is a key and value is john. Similarly kids is a key and it has 2 values peter and cathy. So length of this dictionary is 3 as there are 3 key-value pairs.

We can access values within the key kids and it will show below results.

since kids is a list, we can also use index function to select values within the list. If we want to find the boy within the family, then we can write the below code.

This is a simplistic example but when in a complex code, we are not sure about the type of data within the data we can use type function to reveal it.

Now we know that it is a list and so we can apply all functions that can be applied to any list. This way, selecting values from within the nested data becomes easier.

To find number of kids in a family:

Let's say that there is an arrival of a new born within this family and we want to add it. Since kids is a list, we can append it just like any other list.

As you can see tara is added in the family.

Let's say we want to get first character of kids names in capital letter.

Simple way to do this would be:

And the same thing can also be accomplished in a bit fancy and concise way would be:

Yet another way to accomplish the same result using for loop:

As you can see that using for loop, 3 line code was reduced to 1 line.

But there is no right or wrong method of coding. Whichever way is more convenient and easily understandable, we can use it.

_________________________________________

How to use cumulative functions in python

Working with numbers in a dataset often requires cumulative values so that one can quickly identify total values without having to do any calculations.

This is where python's cumulative functions play their roles.

cumsum( ), cummax( ), cummin( ), cumprod( ), cumcount( )

Let's say that we have a data of sales persons in a company and their sales quantity. There are 3 sales persons and every time they made a sale, it is captured as a row in this dataset.

Now we are creating a new column cumulative_sale which gives cumulative sales after every new sale has been done. And another column cs_person which gives cumulative sales achieved by individual employee after each sale.

Any time, we want to see what is total sales so far, we just look at the value in the cumulative_sales column in the last row (33)

And how many qty sold by each sales person can be found by looking at the cs_person column.

There is another way we can use this function too. Let's say we have dataset of employees working in Toyota dealership have been selling Camry and corolla.

Dataframe given to is showing employee and model sold by them for every sale they made. Each row represents sales transaction done by that employee.

Something like this...

Now our requirement is that we need to show how many corolla and Camry sold by employees at the end of every row(sale)

Using cumcount+1, now we can see that whenever an employee sales a model, in the new column count_by_employee, it shows the total sales done by that employee up to date.

For example, when Shah sold Camry, we know that this is 2nd sale of Shah for this month (1 corolla and 1 Camry)

We take this further by adding so dealership can keep track of total number of model sold up to date.

Now when Shah sold a Camry, dealership manager knows immediately that so fat their dealership has sold 3 Camry to this date.

One more way we can use it by finding per employee per model data

Now when Patel sold Camry, manager knows that this is 2nd Camry sold by him and when Shah sold Camry, it is his 1st Camry sold for this month(he already sold corolla before)

As we can see there are many ways we can apply cumulative functions to make the data more meaningful.

_________________________________________

How to merge two dataframes in python

We are given 2 separate dataframes. One indicates orders and the other has products. We need to combine these 2 dataframes so that we know what products were ordered in each order.

As we can see there is a common column in both dataframe- order_id

So using this we can join these 2 dataframes by merge method.

Before this merge,

Orders dataframe has rows 3421083, columns 7

products dataframe has rows 32434489, columns 4

indicator=True function helps us to identify the source of each row(left/right/both)

What does this mean?

orders is a smaller dataframe but since we need all information provided in the columns of that dataframe, we decided to merge both dataframes using orders as a "must have" data.

So in the combined(merged) dataframe, we can see that 206209 user_id rows were not there in products dataframe. Hence after merge, merge indicator is showing it as left only.

The above is the screenshot of samples of order_id which were there only in orders dataframe and were not in products dataframe.

We can verify by taking a sample order_id 1187899 if it is there in products dataframe or not

This verifies that our understanding is correct about those 206209 order_id that are only in orders dataframe.

The new dataframe order_product has total 11 rows (7 from orders and 4 from products dataframe)

But when orders had 3421083 rows and products had 32434489 rows, why new dataframe has 32640698 rows?

Because 32434489+206209 = 32640698

Since we chose to merge using "left" option, orders had unique 206209 rows and they were added to the 32434489 rows of products and new dataframe has 32640698 rows.

As you can see, merge operations is easy and straightforward but after both dataframes are merged, it requires some introspection to ensure that the data merged as per our expectation.

There are other ways of joining 2 dataframes too but that we will discuss in another post.