We are given 2 separate dataframes. One indicates orders and the other has products. We need to combine these 2 dataframes so that we know what products were ordered in each order.
As we can see there is a common column in both dataframe- order_id
So using this we can join these 2 dataframes by merge method.
Before this merge,
Orders dataframe has rows 3421083, columns 7
products dataframe has rows 32434489, columns 4
indicator=True function helps us to identify the source of each row(left/right/both)

What does this mean?
orders is a smaller dataframe but since we need all information provided in the columns of that dataframe, we decided to merge both dataframes using orders as a "must have" data.
So in the combined(merged) dataframe, we can see that 206209 user_id rows were not there in products dataframe. Hence after merge, merge indicator is showing it as left only.
The above is the screenshot of samples of order_id which were there only in orders dataframe and were not in products dataframe.
We can verify by taking a sample order_id 1187899 if it is there in products dataframe or not
This verifies that our understanding is correct about those 206209 order_id that are only in orders dataframe.The new dataframe order_product has total 11 rows (7 from orders and 4 from products dataframe)
But when orders had 3421083 rows and products had 32434489 rows, why new dataframe has 32640698 rows?
Because 32434489+206209 = 32640698
Since we chose to merge using "left" option, orders had unique 206209 rows and they were added to the 32434489 rows of products and new dataframe has 32640698 rows.
As you can see, merge operations is easy and straightforward but after both dataframes are merged, it requires some introspection to ensure that the data merged as per our expectation.
There are other ways of joining 2 dataframes too but that we will discuss in another post.
No comments:
Post a Comment