Data Analytics: SQL

How to add new dataset and upload table in to big query SQL

If you are using google big query, you can either select dataset from google big query public data or upload your own dataset file.

From the top left hand corner, if you click on Add Data and select public dataset

You have access to a vast number of public datasets available for practicing SQL skills

Once you select public dataset option, it will take you to the next page where you can search from so many datasets(231 at the moment) and start querying them

To upload a dataset from external source, click on three vertical dots in front of your project and select create dataset.

On the page that opens up, project Id should automatically fill, enter name that you want t give to your dataset under data set id and click on create dataset

Now you can see your new dataset customer_data

Next task is to upload dataset file from your computer to this dataset by clicking on three vertical dots besides customer_data and select create table option

Then on the next page, you can see the option to upload your table from your computer and give it a name (we have given it a name customer_address)

If it is asking you to create schema, select auto select

And now we can see our table uploaded in the dataset customer_data

You can preview it and click on create new query to start querying your data.

A general query to take a look at first 10 rows and columns of this table is given below

___________________________________________________________

How to upload dataset and table to bigquery SQL by adding data type manually

In the previous post, we went through uploading table to big query SQL by selecting Auto detect option in schema.

In this post, we will see how we can add data types manually while uploading table.

In the previous post

In this post, instead of selecting auto select, we can select edit as text option

So basically, instead of letting the system select schema automatically, we are going to write schema manually

Here is the snapshot of the csv table that we are going to upload

For columns with numerical values(without decimal point), we will select integer data type

For columns with text values, we will select String data type

For columns with date we will select DateTime data type

For columns with numerical values(with decimal point), we will select float data type

Adding dictionary format in the space to edit text

Once you click create table, our table is uploaded

Our desired table is uploaded and we can see first few rows by clicking preview

How to clean data using SQL

What are possible causes of bad data that we may have to take in to account so we can focus on cleaning effort to make the data sparkling clean...

Duplicates
Truncated data
Extra spaces and character
Null data/Missing data
Mistyped numbers
Inconsistent string
Inconsistent date format
Misleading column names
Mismatched data type
Misspelt words

______________________________________________

Using DISTINCT clean duplicates and inconsistent values

Let's say if a product has only 4 colours and you find out that there are 6 colours in the colour column in the product table

SELECT DISTINCT colour

FROM product

Or, let's say that if colour column has number from 0 to 9 that represents different colours of the product and you want to find out if there is no double digit colour, then use LEN function to find the length of the numbers in the column

SELECT LEN(colour)

FROM product

_____________________________________

Using IS NULL, find the null values from any column

SELECT first_name,last_name

FROM customer

WHERE last_name IS NULL

This query will give us all first names where last names are missing

______________________________________________

Using SUBSTR, find the odd value out

Using SELECT SUBSTR(city,1,3) FROM table_name we can select only first three characters from city column

______________________________________________

Using TRIM, remove unwanted spaces from a string in a column

Below action will remove all leading and trailing spaces from first_name and last_name columns in customer table

UPDATE customer

SET

first_name = TRIM(first_name)

last_name = TRIM(last_name);

______________________________________________

Once you find the correct last name, you can replace it using UPDATE function

UPDATE customer

SET last_name = 'Smith'

WHERE first_name = 'Roger'

If there are more than 1 roger in the list then, you can use customer_id instead of first_name to make this correction

______________________________________________

Often, we are provided with description of the data where you can see max and min values (Range) in a column

Check by using MAX(column_name) and MIN(column_name) to ensure that data des not have incorrect values

For example, if we know that age column in the students table can have values between 10 and 12, then using MAX(age) and MIN(age) we can find if there are any incorrect values

______________________________________________

Once we know there are values less or higher than the expected range, using COUNT function, we can find out how many of these students exist in the list with incorrect age

SELECT COUNT *

FROM students

WHERE age = 13 OR age = 9

______________________________________________

If it appears that there is a significant number of incorrect data in a column, then consult data source for further guideline...should we correct it(find the correct data or replace incorrect values with average values in a column) or delete it from the table using DELETE function

DELETE students

WHERE age = 13 OR age = 9

______________________________________________

If 'age' column has been assigned string data type but the column has numbers, then using CAST function, we can change the data type of column 'age'

SELECT CAST(age AS integer)

FROM students

ORDER BY CAST(age AS integer)

______________________________________________

Some times we need to combine data of 2 columns in 1 column, then we can use CONCAT

For example, date and time are in separate column and we want t combine them together in 1 column.

SELECT CONCAT (column1,column2) AS new_column_name

FROM table_name

______________________________________________

Using COALESCE, we can avoid null values in a column by returning non null values in a list

In product table, we have product code in one column and product name in another column and in one row, you have product name available but product code missing. And we have been told that product code is not available at the moment but we can't remove that row as it is an important product.

So whenever data is pulled for product code, we want to show product in the product_id column instead of null value

SELECT COALESCE(product_code,product_name)

FROM product

______________________________________________

Using CASE statement, we can replace incorrect string with corrected string

SELECT customer_id,

CASE

WHEN first_name = 'Jonh' THEN 'John'

WHEN first_name = 'Rajeev' THEN 'Rajiv'

END AS corrected_name

FROM table_name

_________________________________________

HOW TO WRITE A SIMPLE SQL QUERY

There is a public dataset on kaggle web site which gives you the citi bike usage in New York. Bike sharing company has 2 types of customers. One who are subscribers (most likely commuters who need to use bike everyday from one station to another) and customer(those who may rent the bike as and when needed).

Dataset

I am trying to find out what routes are most popular with different user types. I have this in mind...I would like to see end result in different columns.

This is what I have in mind to accomplish as end result

This is the query I wrote in SQL

SELECT 
        usertype,  
        CONCAT(start_station_name,"to",end_station_name) AS route,
        COUNT(*) AS num_trips,
        ROUND(AVG(CAST(tripduration AS INT64 )/60),2) AS duration       
FROM `bigquery-public-data.new_york.citibike_trips` 
WHERE tripduration IS NOT NULL
GROUP BY usertype,start_station_name,end_station_name
ORDER BY num_trips DESC
LIMIT 10

I am adding start and end station name using CONCAT function and calling the column as route

I am counting rows for all columns and calling the column as num_trips as they are showing me the frequency of trips

I found that tripduration column has numerical values but its data type is string so I am going to convert its data type from string to INT using CAST function. Also I am looking for average trip duration in minutes so using AVG function, I am dividing the result by 60 since data for trip duration is in seconds

And I am removing any rows where trip duration is not available using WHERE tripduration IS NOT NULL function

And finally I am grouping and ordering the result using GROUP BY and ORDER BY functions

This is a simple SQL query and results is below

_________________________________________

EXAMPLE of a sub-query in SQL(Under FORM statement)

There is a public dataset for New York citibike that I have used for this example.

Dataset

This dataset has 2 tables- One is named as citibike_stations and the other is named as citibike_trips

Our requirement is to get data for number of rides starting at each station so we know which is the busiest station for bike business and accordingly bike availability can be maintained.

We have something like this in mind for end result.

Station_id is a common column in both table but only citibike_trips has start_station_id column.

So here is the general idea for querying...

From citibike_trips table, get start_station_id and using COUNT function, get the number of trips for them.

Something like this...

SELECT 
    start_station_id, 
    start_station_name,
    COUNT(*) AS number_of_rides_starting_at_station
FROM 
    citibike_trips
GROUP BY start_station_id
ORDER BY number_of_rides_starting_at_station DESC

This query can also give us the result we desire. 




BUT...

There are some stations mentioned on citibike_trips that are NOT 
on citibike_stations and vice versa...

And, we want to take data for all stations that are common to both tables.

So to ensure we get data for all stations, we will do inner join of both tables. 

Which means that the end result will have data for all stations common
in both tables.

SELECT 
station_id, 
name,
number_of_rides_starting_at_station 
FROM 
(SELECT start_station_id, COUNT(*) AS number_of_rides_starting_at_station
FROM 
citibike_trips
GROUP BY start_station_id) AS station_num_trips
INNER JOIN 
citibike_stations
ON 
station_id = start_station_id
ORDER BY 
number_of_rides_starting_at_station  DESC 

As you can see, our original query is now put within the brackets as 
sub-query and given it an alias station_num_trips

And joined both tables using station_id column.

This is an example of sub-query within FROM statement.

___________________________________________

Example of How to build a complex query in SQL
 
Dataset name: warehouse_orders
Table names: orders, warehouse
I have uploaded these tables in to bigquery and here are contents of these tables
Order table:


warehouse table:

As we can see, warehouse_id is a common column in both tables.

Orders table gives information about the orders fulfilled and warehouse table gives information about warehouse parameters.

Our goal is to calculate how much % of orders are being fulfilled by each warehouse.

Our expected end result should be something like this...

% orders fulfilled = number of orders per warehouse/total orders X 100

No of orders fulfilled can be obtained using COUNT function from orders table
Total number of orders can be obtained from orders table using SUM function
Warehouse name can be obtained from warehouse table (it is already available in column name warehouse_alias
Warehouse_id is in both tables so we can join both tables using this column
Let us join tables in the first
query so we get warehouse_id and warehouse name columns for our end result
SELECT 
warehouse.warehouse_id,
warehouse.warehouse_alias
FROM 
`my-sql-project-336018.warehouse_orders.warehouse` AS warehouse
LEFT JOIN 
`my-sql-project-336018.warehouse_orders.orders` AS orders
ON 
warehouse.warehouse_id = orders.warehouse_id

I am joining tables with LEFT JOIN...why?

Because I want to ensure that all warehouse_id in the warehouse table must be there and the corresponding warehouse_id data in orders table should be there.
Which means, if there are warehouse_id in orders table but not in
warehouse table, they are not required for our task.
Also always check that subquery in itself should run independently because we are going to use its result in the main query. We are getting below result when we run this subquery.


Next, let us find out number of orders by using COUNT function in orders table.
COUNT(orders.order_id) AS number_of_orders_fulfilled
And since using aggregation, add GROUP BY function too...

SELECT 
warehouse.warehouse_id,
warehouse.warehouse_alias,
COUNT(orders.order_id) AS number_of_orders_fulfilled
FROM 
`my-sql-project-336018.warehouse_orders.warehouse` AS warehouse
LEFT JOIN 
`my-sql-project-336018.warehouse_orders.orders` AS orders
ON 
warehouse.warehouse_id = orders.warehouse_id
GROUP BY warehouse.warehouse_id, warehouse.warehouse_alias

This is the result we got so far...



Some warehouse/s has zero fulfilled orders. So when we checked the details of the data, it says some warehouses are newly built and are not in operation yet. So we will have to remove those from the list.

In order to get % of orders fulfilled, we still need total orders fulfilled.
If we run an independent query for that, it would be something like this...

SELECT 
COUNT(*) as total_orders_fulfilled
FROM 
`my-sql-project-336018.warehouse_orders.orders` 

And the result of this query is...

Now adding this into our main query...

SELECT 
warehouse.warehouse_id,
warehouse.warehouse_alias,
COUNT(orders.order_id) AS number_of_orders_fulfilled,
ROUND(COUNT(orders.order_id)/(SELECT COUNT(*) FROM warehouse_orders.orders)*100,2) AS percent_order_fulfilled
FROM
 `my-sql-project-336018.warehouse_orders.warehouse` AS warehouse
LEFT JOIN 
`my-sql-project-336018.warehouse_orders.orders` AS orders
ON 
warehouse.warehouse_id = orders.warehouse_id
GROUP BY warehouse.warehouse_id, warehouse.warehouse_alias

The end result we are getting is...


If we calculate a couple of samples to make sure the result is right, out of total 9999 orders, when 548 are fulfilled, it is 5.48% of fulfillment.

This is good but we want to do some cosmetic changes to this query so it looks more presentable.

First, we want to remove warehouses with zero fulfillment which we will do using HAVING function.

Second, warehouse_alias column gives only name of the warehouse but does not indicate where this warehouse is located. So, using CONCAT function, we can have warehouse name and state name together.

Third, instead of giving exact fulfillment percentages, how about giving it in a range?
For example, we can say 0 to 20% , 20 to 60% and above 60% fulfillment
We can do this using CASE statement

Below is the CONCAT function we are going to use so we get 
State:Warehouse_alias format and we will call that column as warehouse_name

CONCAT(warehouse.state,':',warehouse.warehouse_alias) AS warehouse_name

And, to remove warehouses with zero orders,
HAVING COUNT(orders.order_id) > 0

Now result looks like this...


This looks much better. We will still try one last change here by adding fulfillment ranges from 0 to 20 to 60 using CASE statement...something like this...and we will call this column "fulfillment_summary"

CASE 
WHEN COUNT(orders.order_id)/(SELECT COUNT(*) FROM my-sql-project-336018.warehouse_orders.orders AS orders) <= 0.20
THEN "Fulfilled 0-20% of orders"
WHEN COUNT(orders.order_id)/(SELECT COUNT(*) FROM my-sql-project-336018.warehouse_orders.orders AS orders) > 0.20 AND
COUNT(orders.order_id)/(SELECT COUNT(*) FROM my-sql-project-336018.warehouse_orders.orders AS orders) <= 0.60
THEN "Fulfilled 21-60% or orders"
ELSE 
"Fulfilled more than 60% of orders"
END AS fulfillment_summary

So the full query looks like this...

SELECT 
warehouse.warehouse_id,
CONCAT(warehouse.state,':',warehouse.warehouse_alias) AS warehouse_name,
COUNT(orders.order_id) AS total_orders_fulfilled,
CASE 
WHEN COUNT(orders.order_id)/(SELECT COUNT(*) FROM my-sql-project-336018.warehouse_orders.orders AS orders) <= 0.20
THEN "Fulfilled 0-20% of orders"
WHEN COUNT(orders.order_id)/(SELECT COUNT(*) FROM my-sql-project-336018.warehouse_orders.orders AS orders) > 0.20 AND
COUNT(orders.order_id)/(SELECT COUNT(*) FROM my-sql-project-336018.warehouse_orders.orders AS orders) <= 0.60
THEN "Fulfilled 21-60% or orders"
ELSE 
"Fulfilled more than 60% of orders"
END AS fulfillment_summary
FROM 
`my-sql-project-336018.warehouse_orders.warehouse` AS warehouse
LEFT JOIN 
`my-sql-project-336018.warehouse_orders.orders` AS orders
ON 
warehouse.warehouse_id = orders.warehouse_id
GROUP BY warehouse.warehouse_id, warehouse_name
HAVING COUNT(orders.order_id) > 0
ORDER BY fulfillment_summary DESC

And our final result looks like this...



____________________________________________


HOW TO CREATE A NEW TABLE IN POSTGRE SQL


 First, let us create a new database with the name walmart.
We are trying create a department table for walmart as it has many departments and each department is under a division.

On the next screen, we give it its name walmart and save it.


Now we can see that database got created


Next, we will create table under this database. From the dropdown menu on the top, selects tools and then select Query Tool

It will open new window where we can write SQL code. We are going to name this table as departments and we want 2 columns in it- department and division.
Another important thing to note is we want department column to be the primary key. There can be only 1 primary key and it acts as a constraints for the column. It prevents any duplicate or null values in the column as it is unique key.

The confirmation indicates that table was created successfully.
Note that we want to limit the number of characters that can be
entered in each entry of the table is 100 and that's why using datatype varchar, we have used 100 in the bracket.

If you know that the entries are going to be really small then you can reduce it to 50 if need be.
A good practice is to keep some margin for future purpose. 
May be the list you have right now  is showing all small department names but in future if a new department with longer name needs to be created then you should have enough room in your column to allow it.



And we can see under walmart database schema, that the table
with the name departments is there now.


At the moment we have created empty table without any entry in
any of its rows. Something like this...

Next, we will add data in these columns. We have already prepared
data to be entered in this table. 

We will always confirm after executing the query that it was successful.
And if we want to see how departments table looks like, we can
 check that now.


Using this we can create as many tables we need under this database.
This is a manual process and hence it is recommended only when table to be created is small.

Many times, we are already given .csv or excel file to upload it on SQL server. How to upload that?
We will take one example of that in the next article.

________________________________________


Simple explanation of how grouping works in SQL
To get a general idea of how grouping works in SQL, let's create a small table quickly and then use group function to see what happens.

We have now created fruits table with some entries in it.
This is how this table looks like...

We have intentionally added some Null entry in the column to see how they are going to be grouped. 
Data provided to us often includes null values too and let's see how grouping affects on these values.

GROUP BY command is going to let us know how many items in the column for which groups are formed. So there are 4 groups in the column. Orange, Mango, Apple and Null
If we further add COUNT function to see how many items in each group...


Now we can clearly see how many records in each group. 
It is important to note that Null is considered as a group and it has 3 records.
____________________________________________
How to upload csv file to postgres SQL
 When a spreadsheet is given to you to upload to SQL, you can follow this procedure.
First convert spreadsheet to .csv (comma delimited) file. Just open the file and save as csv (comma delimited) file.
Next, check how many columns are there in your spreadsheet and then create those column names manually in SQL by following this process.

We have created instacart database here. Under Schemas, we can see tables. Right click and select CREATE TABLE option.
Just enter table name that you want to create. We have used here table name as orders.  Do not click SAVE yet.

Before you click SAVE, click on columns from top menu bar. Using plus sign on the right, you can now add more columns one by one.


Now we have created all columns one by one. Just remember that the names of the columns should match exactly same as in csv file. Otherwise while uploading data, you will keep on getting errors.


Another thing is, there are options to set up column constraints by activating NULL and Primary Key options. Null option will prevent any null values in the columns and primary key will allow only unique and non-null values in the column.
Do this ONLY IF you are sure that the data in csv file is cleaned. Otherwise do not activate these constraints because you will keep on getting errors while uploading csv file.
Now you can click SAVE at the bottom of the page. Our table structure is ready.
Next, confirm it by running the SELECT command. We can see all columns created correctly but there is no data yet.
Now, we will upload data from csv file that is saved to our computer to this table.



Right click on orders and select import/export option


Select following options from import/export page.
Turn the option to import
Give the location of your order.csv file on your computer
Select the format as csv
If your csv file has header row displaying column names then select header as Yes
And select delimiter as " , " because our csv file is comma delimited file.


Once you click OK, it is now uploading the csv file and you will get confirmation message once it is done.

Click on more details to see how many rows uploaded in how much time

It is showing 3.4 million rows uploaded in 16.93 seconds. Check your csv file and confirm if it has same number of rows then all data is uploaded. If not check the data in csv file for any issues.
Now we can view the data using SELECT command

____________________________

How to use sub-query to get required data from multiple tables
 When the end result we need, requires us to take data from multiple tables and put it together in one table, this can be accomplished in 2 ways.
Either we join tables using primary keys or use sub-query.
Let's see how can we do it using sub-query by following an example.
We re given 3 tables...
students table has data related to student age, student number and student name

courses table has data related to course_number, course_title and credits per course

student_enrollment table has data related to student_number and course_number

Consider following aspects about our data...

The students table is not directly related to the courses table. It just contains student details. 
The courses table just contains courses information. 
The table that relates both the students table and courses table is the student_enrollment table. Which student is enrolled in which course is captured in the student_enrollment table.
The puzzle to solve is...

Without joining tables, Find out names of students who are taking Physics and US History courses

We can start by writing query that gives us students taking those 2 courses from courses table. 

Remember courses table has only course_no that we can use to get further information from other tables.



Using these course_no, we can get student_no from student_enrollment table


Now, using student_no received from these queries, we can obtain student_name from students table


This is our end result.

Remember that we have not used any table join but got this result only by writing sub-query under WHERE clause for each table.

We can repeat this for as many table as we need to get our desired result.

_________________________________________
How to use conditional statement using CASE
Let's say we have a table that gives information about employees

We want to categorize employees based on hire_date.
Anyone who was hired before 2006-01-01 is Super skilled Employee
Anyone hired between 2006-01-01 and 2012-01-01 is Skilled Employee
Anyone hired after 2012-01-01 is New Employee
Our task is to count them based on hire date and provide results in 3 columns.
Something like this...

We can accomplish this by using CASE statement.
Let's try to derive the first column and see if it comes out as expected.

We are using output as 1 and 0 so we can use SUM function to get the total for the category super_skilled
This looks fine so we can now go ahead and add other 2 categories and get the desired result.

The end result count should total to 1000 as in our table we have 1000 records of students.
226+394+380 = 1000
This confirms that result is correct.
_________________________________________
How to use sub-query in SQL to display data in segments
 We are given a table (fruit_imports) which shows us details of imports of different fruits from various states in USA.

Our task is to segregate fruits based on supply availability as per below criteria.
If supply is below 20000 then mark it as Short supply fruit
If supply is between 20000 and 50000 then mark it as Regular supply fruit
If supply is over 50000 then mark it as Excessive supply fruit
This will help the concerned department to decide on the price of fruits as well as arrange to place more orders for fruits that are in low supply.
We are looking for end result in 3 columns...
Fruit : names of all fruits
Total Supply : From the given table, get total supply of each fruit
Supply Category : SHORT , REGULAR ,EXCESSIVE
This is a very common scenario in many situations. Same query logic can be applied in all such situations.
For example, sale of certain item in segments, segments of customers based on order frequency, creating segments of inventory based on fast-low-regular movement, create groups of days in a year based on hot - average - cold temperature...the list goes on and on...
Since all column names we require are to be included in SELECT statement, that would be a good start
SELECT name AS fruit, SUM(supply) AS total_supply
and for categorization, we can use CASE statement as third column. 
We are going to try one category first (SHORT SUPPLY) and see if results are as per our expectation and then repeat the same process for other two categories
SELECT name AS fruit, SUM(supply) AS total_supply, CASE
WHEN SUM(supply) < 20000 THEN 'SHORT'
END 
AS supply_category
FROM fruit_imports
GROUP BY name

We can verify that when supply is less than 20000 then category is printed as SHORT. This looks good so we can now repeat the process for other two categories and complete the result.

That's how we get out result. Sometimes, if the supply numbers are dynamic and can change, then instead of calculating SUM(supply) in the main SELECT query, we can also do it in a subquery under FORM statement
Here is an alternative way...

Note that in the main SELECT query, we are using aliases for name and SUM(supply) that we used in sub-query under FORM statement.
Getting the same result either way.
_________________________________________
Difference between corelated subquery and windows function
 We are given employees table as below.

And our requirement is to get an output in 3 columns. First Name, Department and total no. of employees per department.
One of the ways to achieve this is by using correlational sub-query.
Something like this...
We are relating inner query(sub-query) with outer query on department column so inner query is executed for every record that exists in outer query.
If employees table has 1000 records, then inner query is run for all 1000 records. Understandably, this is a very slow process for large data tables and also an expensive one as these days most companies are paying for the cloud usage per time used basis.

Alternative way to do this by using windows function OVER( )

The advantage is that it is much quicker and we can use it to group data using any column without using GROUP BY function.
We can use OVER( ) for any aggregate functions like SUM or AVG or MAX or MIN etc.
And if we want to ensure that all records are returned in both methods, we can use EXCEPT to see if any difference.

This confirms that both are getting same results
_________________________________________
How to find duplicate values in SQL using window function
 We are given email table which includes user_id, user_name and email
and our task is to find out duplicate record from this table.


One way to do this would be by using a simple GROUP BY functionSince user_id is a primary key and user_name is supposed to be unique for each customer, we can check if there are any duplicates there.

First, Let's check for user_id


There are no duplicates found in user_id column.

Now, let's check user_name column


We found duplicate user_name Keeny so that is the record we will check and find out why is it duplicate and make correction if required(same user_name but different email address) or remove it from the list (same user_name and email address)

How can we get this result using windows function?

Idea: 

We can add row number column and assign row numbers to these records. Row numbers are going to be in order from 1 to 5 (sale as user_id).


And then if we partition them by user_name (which should be unique for each user), and if there are any duplicate user_name then row number would be different for same user_name.

Something like this...


And now, it is an easy task to pick a row number the one which is not unique

So we know that Keeny is a duplicate user_name.
Using this method we can find any duplicates in the column. Just have to make sure that the column that we are taking as a reference (user_name in this case) is supposed to have unique records.
In this table, user_id is also unique and user_name is also unique. But after checking for duplicates in user_id, we could not get any duplicates so we checked for the other unique value column and found duplicate.
In case if we did not get duplicate using user_name then we could check duplicate in email column also as each user_name is supposed to have a unique email address.
_________________________________________
How to use LEAD function in SQL
We are given a table with details of login_id, user_name and login_date. This data is from an e commerce portal where we want to know if a user logged in 3 times or more in a day.




We can see that Stewart logged in 3 times on Feb 16 hence that is our expected result.
Idea:
If the user name in current row is equal to the user name in the next row (LEAD) AND user name is current row is equal to the user name in the second next row (LEAD)  AND same condition for login date, then that would be the user name we are looking for
This is how we can accomplish it in a query



You can see here we are using LEAD(user_name,1) for the next row and LEAD(user_name,2) for the next to next row.
Now, since the hard part is already completed, it is a matter of using WHERE to choose everything in the repeat_user column that is NOT NULL to get our answer.

Same result can also be obtained by using another idea...by selecting 1 row above and 1 row below the current row and matching user name
Like this...


____________________________
When to use self join in SQL
Requirement:
 Find acceptance rate of face book friends requests by Date. The date to be used for calculation is the date on which the request was sent.
We have a dataset of face book friends requests list.

Let us list down steps we need to take to accomplish the intended result.
Find total number of requests accepted (A)
Find total number of requests sent (B)
Divide A/B to get the results
Approach:
Although in SQL, we can arrive at the same result in multiple ways, 
for this type of scenario where we have to match one record from a table with another record in the same table, self join is usually the best option

We will self join this table to match sent and accepted records to figure out how many sent requests were accepted. 

Something like this...


Next, we can now find total number of requests sent by date...


Now, at this point we have 2 options to join these two results. Either by creating CTE or by subquery.

For the reason of performance and presentations, I am going to do it by using CTE using WITH clause ...




Now its a question of doing the simple math to divide 2 columns to get acceptance rate


_________________________________________


Complex query example
Requirement:
 You are given the table with titles of recipes from a cookbook and their page numbers. You are asked to represent how the recipes will be distributed in the book.
Produce a table consisting of three columns: left_page_number, left_title and right_title. The k-th row (counting from 0), should contain the number and the title of the page with number 2*k in the first and second columns respectively, and the title of the page with number 2*k+1 in the second column.
Each page contains at most 1 recipe. If the page does not contain a recipe, the appropriate cell should remain empty (NULL value). Page 0 (the internal side of the front cover) is guaranteed to be empty.
Data :

Desired Output :



Logic: 
1) Get page numbers from 0 to 15 in a column
2) Segregate titles based on odd/even rows
3) Reduce rows from 16 to 8


We can use Generate_series function to derive numbers from 0 to 15


But what if in future there will be more pages added to this recipe book?  In that case, instead of hardcoding page numbers from 0 to 15, we can let the subquery decide how many page numbers are needed.


Now we have 15 rows in generate_series table but the desired output has only 8 rows.

We can use left join to connect generate_series and cookbook_titles

And to make it easier to follow, we would define CTE for these 2 queries.




Now we got 2 columns that we required and the task from here is to add a 3rd column using conditional case statement. Since our desired output has only even page numbers, we can use page number odd or even as a criteria to create 3rd column.


Now we have segregated recipes in to left and right title page but we still have 16 rows. Our desired output has only 8 rows. We can use groupby concept to reduce number of rows from 16 to 8 and keep the data same in left_title and right_title columns