Ways to import files in pandas (Python)

 We all know that pd.read_csv is the code that is used to import file to python. 

Sometimes, for various reasons, we do not need to import the complete file but only some of it. There are 5 standard functions we can use to accomplish this.

For example, we have imported this entire dataset to python using pd.read_csv function.



1) Import data by specifying column names

2) Choosing usecols function to keep only those columns that are needed

3) Specify data types of the columns that you want to import

4) Using nrows function, read only specified number of rows instead of entire dataset

5) Using na_values function to recognize strings as NaN

Now, instead of importing the entire dataset, we are going to import only 4 columns that are needed for our analysis.

df = pd.read_csv (r'Downloads\police.csv',usecols=['stop_date','driver_age','violation','stop_duration'],dtype={'violation':'category'})


By importing only columns that we need, memory is saved. And if certain column has items that can consume more memory, at the time of importing that column, if we convert data types to 'category', it will save memory.

Another way to manipulate data for a dataset that has millions of rows, is to import only some rows.

In the above example, dataset has 97000+ rows. If we want to import only 1000 for now to get a feel of the data, we can use nrows=1000 function to accomplish this.

df =  pd.read_csv(r'Downloads\police.csv',usecols=['stop_date','driver_age','violation','stop_duration'],dtype={'violation':'category'},nrows=1000)


If the dataset has rows with no values and we want to mark them as NaN so we can clean this while doing data wrangling, we can use na_values function at the time of importing dataset. It will identify missing data and mark it as NaN and import it.



No comments:

Post a Comment

Complex query example

Requirement:  You are given the table with titles of recipes from a cookbook and their page numbers. You are asked to represent how the reci...