Ways to import files in pandas (Python)
We all know that pd.read_csv is the code that is used to import file to python.
Sometimes, for various reasons, we do not need to import the complete file but only some of it. There are 5 standard functions we can use to accomplish this.
For example, we have imported this entire dataset to python using pd.read_csv function.
1) Import data by specifying column names
2) Choosing usecols function to keep only those columns that are needed
3) Specify data types of the columns that you want to import
4) Using nrows function, read only specified number of rows instead of entire dataset
5) Using na_values function to recognize strings as NaN
Now, instead of importing the entire dataset, we are going to import only 4 columns that are needed for our analysis.
df = pd.read_csv (r'Downloads\police.csv',usecols=['stop_date','driver_age','violation','stop_duration'],dtype={'violation':'category'})
By importing only columns that we need, memory is saved. And if certain column has items that can consume more memory, at the time of importing that column, if we convert data types to 'category', it will save memory.
Another way to manipulate data for a dataset that has millions of rows, is to import only some rows.
In the above example, dataset has 97000+ rows. If we want to import only 1000 for now to get a feel of the data, we can use nrows=1000 function to accomplish this.
df = pd.read_csv(r'Downloads\police.csv',usecols=['stop_date','driver_age','violation','stop_duration'],dtype={'violation':'category'},nrows=1000)
If the dataset has rows with no values and we want to mark them as NaN so we can clean this while doing data wrangling, we can use na_values function at the time of importing dataset. It will identify missing data and mark it as NaN and import it.
__________________________________________________________________________________
Python Data types
The concept of data type exists in any programming language. The simple understanding of it is that any variable stores data of any type. For example, integer number, decimal number, text etc.
While using python, different types of data is known as data type.
Data types (int, float, str, boolian)
X = 6 contains an integer number is known as data type int
X = 6.5 contains a number with decimal is known as data type float
X = contains text 'Today is a great day' is known as data type string str
X = True contains close ended decisive data yes/no is known as data type bool
_________________________________________________________________________________
How to create sample dataset using pandas and numpy
Sometimes, we just want to try something on a dataframe and we don't have any other dataframe available. In this situation, we can quickly create a dataframe sample using pandas pd.dataframe function and we can add some random data into this dataframe using numpy's np.rand function.
First we import both pandas and numpy libraries
Using pandas alone...
df = pd.DataFrame({'apple':[100,200,300],'orange':[400,500,600]})
df
Using pandas and numpy both...
To make more rows and columns, just provide under rand function as
rand(rows,columns) and names of columns under list(column names)
df1 = pd.DataFrame(np.random.rand(10,8),columns=list('ABCDEFGH'))
df1
_________________________________________________________________________________
Conditional statement in python (if, elseif, else)
Check the below code and see what can you make out of it
x = 8
if x > 0:
print('it is true')
elif x == 0:
print('it is zero')
else:
print('unknown')
it is true
if is a conditional statement
elif (else if) provides you option to verify any other conditions and we can use elif as many times as we need to.
else provides you with the last option if any of the defined conditions are not true
Let's say in your dataframe, there is a column for province and it lists province name for that row.
For your analysis, you don't need to go by every province but you decide to create 3 regions from these provinces. Eastern, Central and Western
And you want to list provinces under each region
Eastern
new_brunswick, nova_scotia, prince_edward_island
Central
ontario, quebec, manitoba
Western
alberta, british_columbia, saskatchewan
Using conditional statement, you can create a new column 'region'
if (province=='alberta') or (province == 'british_columbia') or (province == 'saskatchewan'):
region.append('western_canada')
elif (province=='ontario') or (province == 'manitoba') or (province == 'quebec'):
region.append('central_canada')
else:
region.append('eastern_canada')
As we can see, conditional statements are very useful in classifying or grouping the data into segments which can be easily visualized.
________________________________________________
How to define customized function in python
Although, python has a large number of standard functions to execute the calculation or logic we need to apply to our dataset, there are always going to be some situations when we need to create a customized calculation logic to arrive at our desired results.
In this scenario, we can create our own function using def(define) code.
A simple example is given below
Let's say, we want to make a calculator which can add the two values we enter.
Simple calculation function that we can create would be like this...
def calculate(x, y, operation):
if operation == 'add':
return x+y
If add is the only transaction we need from this calculator, then we can define it as a default operation
def calculate(x, y, operation = 'add'):
return x+y
This will return us the result
As you can see that by setting a default operation as 'add', we do not need to enter it when we write the code.
But for it to carry out multiple calculations like subtract, multiply, division etc. we have to write conditions using conditional statements if, elif and else
def calculate(a,b,operation='add'):
if operation == 'add':
return a+b
elif operation == 'subtract':
return a-b
elif operation == 'multiply':
return a*b
elif operation == 'percent':
return a/b*100
else:
return 'Error'
Here we are giving multiple conditions so calculator can perform the task we need it to execute
We can see here that since division is not defined, it gives back Error
This is the simple application of executing our own function to meet our need
Another example is...
Let's say you are working in the payroll and you are asked to calculate pay for employees in such a way that when they work 40 hours or less, they get paid as per their standard rate but once they work any additional hours after 40, they need to be paid over time at the rate of 1.5 times their regular hour
And company pays a flat rate of $100 per hour to all contractors and hence if there is no hourly rate mentioned, then your calculation should consider default hourly rate of 100 and contractors are not paid over time irrespective of hours they work
Let's try to build a customized function for this requirement
def total_pay(hours, rate_per_hour = 100):
if hours <= 40:
return hours*rate_per_hour
elif rate_per_hour == 100:
return hours*rate_per_hour
else:
base_pay = 40*rate_per_hour
overtime_pay = (rate_per_hour*1.5) * (hours-40)
return base_pay + overtime_pay
As you can see that when hours worked are n=more than 40, we are going to calculate 1.5 times rate for only those hours worked after 40
As you can see, for employee whose rate is mentioned as $20 per hours, gets paid 800 for 40 hours (40X20) and for extra 1 hour worked, gets paid for $30(1.5X20X1)
But for contractor working for 41 hours only a flat rate of $100 is considered and gets paid $4100
Defining a customized function is a very useful feature of python that we can use frequently for getting the results we want.
__________________________________________________________________________________
How to find the bad data and fix it
When we are given any dataset, we first need to clean the data before we can analyze it. There are many considerations in this process. And there can be different outcome in different situations so one can't list steps to accomplish results for any dataset as it would be unique for each of them.
However, we have provided an example below that would help us to start the process and then asking questions or making some assumptions in order to decide what to do with this bad data.
In the dataset given below, there is a column stop_duration and we want to find out if there is any bad data in in that column and if so, how do we fix it
First, in order to find if there is any bad data, let's do value counts to understand data
DF.stop_duration.value_counts(dropna=False)
DF is the dataframe, stop_duration is the column and we are doing value counts including null values(that's why we used dropna=False in the bracket)
Apparently this column has duration classified in segments of 0-15 min, 16-30 min, 30+ min
NaN represents null values which means those values were not recorded- either due to a data capture error or due to data corruption or for some other reason. If we are not sure, we can ask the data source if this NaN values may be genuine for any reason.
But, 1 and 2 are certainly looking like a bad data here. In this scenario, if we have to assume what they could be, stretch your imagination and think what they possible could be.
- 1 may mean1 hour and 2 may mean 2 hour?
- Is that a typo?
- Since there are only 2 entries(1 each) like this in the entire column, we can assume that this may be bad data?
At this point, since there are only 1 record only for 1 and 2, it is safe to assume that it is a bad data and we need to fix it. If there were let's say, 1000 each for 1 and 2, then situation would be different as we just can't assume that it is an erroneous data and we need to ask more question to the data source to understand what could they mean before we decide what to do with it.
Possible fix suggestions?
- 1 can be interpreted as 60 min and 2 as 120 min and add them into 30+ min records
- Mark 1 and 2 as null values
- Remove rows that has 1 and 2
For this one, we decide to mark 1 and 2 as null values as it occurred only 1 time each.
Since NaN is a numpy function, ensure that you import numpy library too along with pandas to accomplish this code
DF.loc[(DF.stop_duration=='1')|(DF.stop_duration=='2'),'stop_duration']=np.NaN
As you can see now that 1 and 2 are no more showing up in value_counts and NaN has changed from 5333 to 5335
This is just one of many scenarios how we can find bad data and fix it
________________________________________________
How to use LAST function
We have grocery purchase data for this month and one of the fast moving item is eggs. We have a list of customers who buy eggs regularly and we want know when was the last date customers bought eggs this month.
We can see that last date of buying eggs for customers Patel, Raval and Shah was 21,17 and 29
But how to run a code that can give this information?
eggs.groupby('customer').Date.last()
Here, eggs is our dataframe, customer and Date are columns
This is a very useful function and it is helpful in getting a list of customers who have not bought eggs for certain no of days so they can be contacted by a campaign.
__________________________________________________________________________
For Loop vs. List Comprehension
In python, same result can be accomplished by writing the code in different ways. We can use whichever code we are more comfortable with.
But while working in a company, if there are people who feel comfortable writing the code in different ways, then it is important for us to know different ways of writing a code as you may have to read someone else's code who is not used to write it your way!
Let's check by an example:
A simple for loop code is...
fruits = ['orange', 'mango', 'pineapple']
For getting the same result, we can also write it using list comprehension
For getting this result, we need not to assign the list to anything. But if we want to, we can do that too.
We can also slice the list
Or even use conditions inside the list
Another conditional expression
Using transform( ) in python
We are given below data from an online grocery store and we are trying to create a new column which gives accumulated values for an order. user_id is customer
This customer (user_id=1) has placed total 11 orders so far and each order has unique order_id.
We are trying to find out total no. of orders placed by customers in a separate column, so we know which customers are our top 100 customers based on no. of orders placed.
One way to obtain this information is by using count function.
orders.groupby('user_id').order_number.count()
So it means cust-1 placed 11 orders, cust 2 placed 15 orders, and so on...
This is fine but we want to show it as a separate column on orders dataframe so we can see it along with other data.
orders['ord_count'] = orders.groupby('user_id').order_number.count()
As we can see, this does not give the correct presentation of the data.
So, another way to accomplish this is by using transform( )
orders['order_count'] = orders.groupby('user_id').order_number.transform('count')
And now we can see that the total order count is presented properly.
In simple term, data within the data is generally called a nested data.
A list within a list or a table within a table are examples of nested data.
Imagine a complex and extensive codes. You can understand that at times it is difficult to understand it quickly. Hence, a good practice is to divide in to smaller groups and find their meaning so gradually you can start unwrapping the whole code.
One simple example is the nested dictionary data.
Now we can add a list within this dictionary to make it a nested data
We are adding cathy under kids.
So now this becomes a nested data. List within dictionary.
And now since kids is a list, we can apply all functions that are applicable to list in order to access data within this list.
In this dictionary, there are key-value pairs. dad is a key and value is john. Similarly kids is a key and it has 2 values peter and cathy. So length of this dictionary is 3 as there are 3 key-value pairs.
We can access values within the key kids and it will show below results.
since kids is a list, we can also use index function to select values within the list. If we want to find the boy within the family, then we can write the below code.
This is a simplistic example but when in a complex code, we are not sure about the type of data within the data we can use type function to reveal it.
Now we know that it is a list and so we can apply all functions that can be applied to any list. This way, selecting values from within the nested data becomes easier.

Working with numbers in a dataset often requires cumulative values so that one can quickly identify total values without having to do any calculations.
This is where python's cumulative functions play their roles.
cumsum( ), cummax( ), cummin( ), cumprod( ), cumcount( )
Let's say that we have a data of sales persons in a company and their sales quantity. There are 3 sales persons and every time they made a sale, it is captured as a row in this dataset.
Now we are creating a new column cumulative_sale which gives cumulative sales after every new sale has been done. And another column cs_person which gives cumulative sales achieved by individual employee after each sale.
Any time, we want to see what is total sales so far, we just look at the value in the cumulative_sales column in the last row (33)
And how many qty sold by each sales person can be found by looking at the cs_person column.
There is another way we can use this function too. Let's say we have dataset of employees working in Toyota dealership have been selling Camry and corolla.
Dataframe given to is showing employee and model sold by them for every sale they made. Each row represents sales transaction done by that employee.
Something like this...
Now our requirement is that we need to show how many corolla and Camry sold by employees at the end of every row(sale)
Using cumcount+1, now we can see that whenever an employee sales a model, in the new column count_by_employee, it shows the total sales done by that employee up to date.
For example, when Shah sold Camry, we know that this is 2nd sale of Shah for this month (1 corolla and 1 Camry)
We take this further by adding so dealership can keep track of total number of model sold up to date.
Now when Shah sold a Camry, dealership manager knows immediately that so fat their dealership has sold 3 Camry to this date.
One more way we can use it by finding per employee per model data
Now when Patel sold Camry, manager knows that this is 2nd Camry sold by him and when Shah sold Camry, it is his 1st Camry sold for this month(he already sold corolla before)
As we can see there are many ways we can apply cumulative functions to make the data more meaningful.
_________________________________________
How to merge two dataframes in python
We are given 2 separate dataframes. One indicates orders and the other has products. We need to combine these 2 dataframes so that we know what products were ordered in each order.
As we can see there is a common column in both dataframe- order_id
So using this we can join these 2 dataframes by merge method.
Before this merge,
Orders dataframe has rows 3421083, columns 7
products dataframe has rows 32434489, columns 4
indicator=True function helps us to identify the source of each row(left/right/both)

The above is the screenshot of samples of order_id which were there only in orders dataframe and were not in products dataframe.
But when orders had 3421083 rows and products had 32434489 rows, why new dataframe has 32640698 rows?
No comments:
Post a Comment