Data Analytics

Complex query example

Requirement:

You are given the table with titles of recipes from a cookbook and their page numbers. You are asked to represent how the recipes will be distributed in the book.

Produce a table consisting of three columns: left_page_number, left_title and right_title. The k-th row (counting from 0), should contain the number and the title of the page with number 2*k in the first and second columns respectively, and the title of the page with number 2*k+1 in the second column.

Each page contains at most 1 recipe. If the page does not contain a recipe, the appropriate cell should remain empty (NULL value). Page 0 (the internal side of the front cover) is guaranteed to be empty.

Data :

Desired Output :

Logic:

1) Get page numbers from 0 to 15 in a column

2) Segregate titles based on odd/even rows

3) Reduce rows from 16 to 8

We can use Generate_series function to derive numbers from 0 to 15

But what if in future there will be more pages added to this recipe book? In that case, instead of hardcoding page numbers from 0 to 15, we can let the subquery decide how many page numbers are needed.

Now we have 15 rows in generate_series table but the desired output has only 8 rows.

We can use left join to connect generate_series and cookbook_titles

And to make it easier to follow, we would define CTE for these 2 queries.

Now we got 2 columns that we required and the task from here is to add a 3rd column using conditional case statement. Since our desired output has only even page numbers, we can use page number odd or even as a criteria to create 3rd column.

Now we have segregated recipes in to left and right title page but we still have 16 rows. Our desired output has only 8 rows. We can use groupby concept to reduce number of rows from 16 to 8 and keep the data same in left_title and right_title columns

When to use self join in SQL

Requirement:

Find acceptance rate of face book friends requests by Date. The date to be used for calculation is the date on which the request was sent.

We have a dataset of face book friends requests list.

Let us list down steps we need to take to accomplish the intended result.

Find total number of requests accepted (A)
Find total number of requests sent (B)
Divide A/B to get the results

Approach:

Although in SQL, we can arrive at the same result in multiple ways,

for this type of scenario where we have to match one record from a table with another record in the same table, self join is usually the best option

We will self join this table to match sent and accepted records to figure out how many sent requests were accepted.

Something like this...

Next, we can now find total number of requests sent by date...

Now, at this point we have 2 options to join these two results. Either by creating CTE or by subquery.

For the reason of performance and presentations, I am going to do it by using CTE using WITH clause ...

Now its a question of doing the simple math to divide 2 columns to get acceptance rate

How to SUM number of consecutive rows or columns in EXCEL

We have the data for the stocks of Apples and Mangoes at a grocery store.

We want to get a total for all Apples and Total for all mangoes in the store.

Here since the data is small, we can use SUM ( ) and get the answer but in a large data set, we can get this result quickly using OFFSET formula along with SUM

OFFSET helps to identify location and number of cells to add and then SUM does the addition.

Taking E5 as a reference to start locating rows we need for our calculation, our first row to start calculating from is E6. Hence in OFFSET formula, we add E5 as reference cell and then 1st row as start point. Since we are not considering to add any columns, we would add 0 for the column and then we need 4 rows from our reference row, hence would add 4 for the height in the OFFSET formula to locate our data of 4 rows.

=OFFSET(E5,1,0,4)

And then using SUM (OFFSET(E5,1,0,4)) we can get total for this 4 rows.

It is very common to use this formula while calculating weekly, monthly, quarterly or yearly totals or averages. It is also popularly used in finding running totals.

Using the same method, we can get totals in columns also.

How to use LEAD function in SQL

We are given a table with details of login_id, user_name and login_date. This data is from an e commerce portal where we want to know if a user logged in 3 times or more in a day.

We can see that Stewart logged in 3 times on Feb 16 hence that is our expected result.

Idea:

If the user name in current row is equal to the user name in the next row (LEAD) AND user name is current row is equal to the user name in the second next row (LEAD) AND same condition for login date, then that would be the user name we are looking for

This is how we can accomplish it in a query

You can see here we are using LEAD(user_name,1) for the next row and LEAD(user_name,2) for the next to next row.

Now, since the hard part is already completed, it is a matter of using WHERE to choose everything in the repeat_user column that is NOT NULL to get our answer.

Same result can also be obtained by using another idea...by selecting 1 row above and 1 row below the current row and matching user name

Like this...

How to find duplicate values in SQL using window function

We are given email table which includes user_id, user_name and email

and our task is to find out duplicate record from this table.

One way to do this would be by using a simple GROUP BY function

Since user_id is a primary key and user_name is supposed to be unique for each customer, we can check if there are any duplicates there.

First, Let's check for user_id

There are no duplicates found in user_id column.

Now, let's check user_name column

We found duplicate user_name Keeny so that is the record we will check and find out why is it duplicate and make correction if required(same user_name but different email address) or remove it from the list (same user_name and email address)

How can we get this result using windows function?

Idea:

We can add row number column and assign row numbers to these records. Row numbers are going to be in order from 1 to 5 (sale as user_id).

And then if we partition them by user_name (which should be unique for each user), and if there are any duplicate user_name then row number would be different for same user_name.

Something like this...

And now, it is an easy task to pick a row number the one which is not unique

So we know that Keeny is a duplicate user_name.

Using this method we can find any duplicates in the column. Just have to make sure that the column that we are taking as a reference (user_name in this case) is supposed to have unique records.

In this table, user_id is also unique and user_name is also unique. But after checking for duplicates in user_id, we could not get any duplicates so we checked for the other unique value column and found duplicate.

In case if we did not get duplicate using user_name then we could check duplicate in email column also as each user_name is supposed to have a unique email address.

Difference between corelated subquery and windows function

We are given employees table as below.

And our requirement is to get an output in 3 columns. First Name, Department and total no. of employees per department.

One of the ways to achieve this is by using correlational sub-query.

Something like this...

We are relating inner query(sub-query) with outer query on department column so inner query is executed for every record that exists in outer query.

If employees table has 1000 records, then inner query is run for all 1000 records. Understandably, this is a very slow process for large data tables and also an expensive one as these days most companies are paying for the cloud usage per time used basis.

Alternative way to do this by using windows function OVER( )

The advantage is that it is much quicker and we can use it to group data using any column without using GROUP BY function.

We can use OVER( ) for any aggregate functions like SUM or AVG or MAX or MIN etc.

And if we want to ensure that all records are returned in both methods, we can use EXCEPT to see if any difference.

This confirms that both are getting same results

How to use sub-query in SQL to display data in segments

We are given a table (fruit_imports) which shows us details of imports of different fruits from various states in USA.

Our task is to segregate fruits based on supply availability as per below criteria.

If supply is below 20000 then mark it as Short supply fruit

If supply is between 20000 and 50000 then mark it as Regular supply fruit

If supply is over 50000 then mark it as Excessive supply fruit

This will help the concerned department to decide on the price of fruits as well as arrange to place more orders for fruits that are in low supply.

We are looking for end result in 3 columns...

Fruit : names of all fruits

Total Supply : From the given table, get total supply of each fruit

Supply Category : SHORT , REGULAR ,EXCESSIVE

This is a very common scenario in many situations. Same query logic can be applied in all such situations.

For example, sale of certain item in segments, segments of customers based on order frequency, creating segments of inventory based on fast-low-regular movement, create groups of days in a year based on hot - average - cold temperature...the list goes on and on...

Since all column names we require are to be included in SELECT statement, that would be a good start

SELECT name AS fruit, SUM(supply) AS total_supply

and for categorization, we can use CASE statement as third column.

We are going to try one category first (SHORT SUPPLY) and see if results are as per our expectation and then repeat the same process for other two categories

SELECT name AS fruit, SUM(supply) AS total_supply, CASE

WHEN SUM(supply) < 20000 THEN 'SHORT'

END

AS supply_category

FROM fruit_imports

GROUP BY name

We can verify that when supply is less than 20000 then category is printed as SHORT. This looks good so we can now repeat the process for other two categories and complete the result.

That's how we get out result. Sometimes, if the supply numbers are dynamic and can change, then instead of calculating SUM(supply) in the main SELECT query, we can also do it in a subquery under FORM statement

Here is an alternative way...

Note that in the main SELECT query, we are using aliases for name and SUM(supply) that we used in sub-query under FORM statement.

Getting the same result either way.

How to use conditional statement using CASE

Let's say we have a table that gives information about employees

We want to categorize employees based on hire_date.

Anyone who was hired before 2006-01-01 is Super skilled Employee

Anyone hired between 2006-01-01 and 2012-01-01 is Skilled Employee

Anyone hired after 2012-01-01 is New Employee

Our task is to count them based on hire date and provide results in 3 columns.

Something like this...

We can accomplish this by using CASE statement.

Let's try to derive the first column and see if it comes out as expected.

We are using output as 1 and 0 so we can use SUM function to get the total for the category super_skilled

This looks fine so we can now go ahead and add other 2 categories and get the desired result.

The end result count should total to 1000 as in our table we have 1000 records of students.

226+394+380 = 1000

This confirms that result is correct.

How to use sub-query to get required data from multiple tables

When the end result we need, requires us to take data from multiple tables and put it together in one table, this can be accomplished in 2 ways.

Either we join tables using primary keys or use sub-query.

Let's see how can we do it using sub-query by following an example.

We are given 3 tables...

students table has data related to student age, student number and student name

courses table has data related to course_number, course_title and credits per course

student_enrollment table has data related to student_number and course_number

Consider following aspects about our data...

The students table is not directly related to the courses table. It just contains student details.
The courses table just contains courses information.
The table that relates both the students table and courses table is the student_enrollment table. Which student is enrolled in which course is captured in the student_enrollment table.

The puzzle to solve is...

Without joining tables, Find out names of students who are taking Physics and US History courses

We can start by writing query that gives us students taking those 2 courses from courses table.

Remember courses table has only course_no that we can use to get further information from other tables.

Using these course_no, we can get student_no from student_enrollment table

Now, using student_no received from these queries, we can obtain student_name from students table

This is our end result.

Remember that we have not used any table join but got this result only by writing sub-query under WHERE clause for each table.

We can repeat this for as many table as we need to get our desired result.

How to upload csv file to postgres SQL

When a spreadsheet is given to you to upload to SQL, you can follow this procedure.

First convert spreadsheet to .csv (comma delimited) file. Just open the file and save as csv (comma delimited) file.

Next, check how many columns are there in your spreadsheet and then create those column names manually in SQL by following this process.

We have created instacart database here. Under Schemas, we can see tables. Right click and select CREATE TABLE option.

Just enter table name that you want to create. We have used here table name as orders. Do not click SAVE yet.

Before you click SAVE, click on columns from top menu bar. Using plus sign on the right, you can now add more columns one by one.

Now we have created all columns one by one. Just remember that the names of the columns should match exactly same as in csv file. Otherwise while uploading data, you will keep on getting errors.

Another thing is, there are options to set up column constraints by activating NULL and Primary Key options. Null option will prevent any null values in the columns and primary key will allow only unique and non-null values in the column.

Do this ONLY IF you are sure that the data in csv file is cleaned. Otherwise do not activate these constraints because you will keep on getting errors while uploading csv file.

Now you can click SAVE at the bottom of the page. Our table structure is ready.

Next, confirm it by running the SELECT command. We can see all columns created correctly but there is no data yet.

Now, we will upload data from csv file that is saved to our computer to this table.

Right click on orders and select import/export option

Select following options from import/export page.

Turn the option to import
Give the location of your order.csv file on your computer
Select the format as csv
If your csv file has header row displaying column names then select header as Yes
And select delimiter as " , " because our csv file is comma delimited file.

Once you click OK, it is now uploading the csv file and you will get confirmation message once it is done.

Click on more details to see how many rows uploaded in how much time

It is showing 3.4 million rows uploaded in 16.93 seconds. Check your csv file and confirm if it has same number of rows then all data is uploaded. If not check the data in csv file for any issues.

Now we can view the data using SELECT command