How to add new dataset and upload table in to big query SQL
If you are using google big query, you can either select dataset from google big query public data or upload your own dataset file.
From the top left hand corner, if you click on Add Data and select public dataset
You have access to a vast number of public datasets available for practicing SQL skills
Once you select public dataset option, it will take you to the next page where you can search from so many datasets(231 at the moment) and start querying them
To upload a dataset from external source, click on three vertical dots in front of your project and select create dataset.
On the page that opens up, project Id should automatically fill, enter name that you want t give to your dataset under data set id and click on create dataset
Now you can see your new dataset customer_data
Next task is to upload dataset file from your computer to this dataset by clicking on three vertical dots besides customer_data and select create table option
Then on the next page, you can see the option to upload your table from your computer and give it a name (we have given it a name customer_address)
If it is asking you to create schema, select auto select
And now we can see our table uploaded in the dataset customer_data
You can preview it and click on create new query to start querying your data.
A general query to take a look at first 10 rows and columns of this table is given below
How to upload dataset and table to bigquery SQL by adding data type manually
In the previous post, we went through uploading table to big query SQL by selecting Auto detect option in schema.
In this post, we will see how we can add data types manually while uploading table.
In the previous post
In this post, instead of selecting auto select, we can select edit as text option
So basically, instead of letting the system select schema automatically, we are going to write schema manually
Here is the snapshot of the csv table that we are going to upload
For columns with numerical values(without decimal point), we will select integer data type
For columns with text values, we will select String data type
For columns with date we will select DateTime data type
For columns with numerical values(with decimal point), we will select float data type
Adding dictionary format in the space to edit text
Once you click create table, our table is uploaded
Our desired table is uploaded and we can see first few rows by clicking preview
How to clean data using SQL
What are possible causes of bad data that we may have to take in to account so we can focus on cleaning effort to make the data sparkling clean...
- Duplicates
- Truncated data
- Extra spaces and character
- Null data/Missing data
- Mistyped numbers
- Inconsistent string
- Inconsistent date format
- Misleading column names
- Mismatched data type
- Misspelt words
______________________________________________
Using DISTINCT clean duplicates and inconsistent values
Let's say if a product has only 4 colours and you find out that there are 6 colours in the colour column in the product table
SELECT DISTINCT colour
FROM product
Or, let's say that if colour column has number from 0 to 9 that represents different colours of the product and you want to find out if there is no double digit colour, then use LEN function to find the length of the numbers in the column
SELECT LEN(colour)
FROM product
_____________________________________
Using IS NULL, find the null values from any column
SELECT first_name,last_name
FROM customer
WHERE last_name IS NULL
This query will give us all first names where last names are missing
______________________________________________
Using SUBSTR, find the odd value out
Using SELECT SUBSTR(city,1,3) FROM table_name we can select only first three characters from city column
______________________________________________
Using TRIM, remove unwanted spaces from a string in a column
Below action will remove all leading and trailing spaces from first_name and last_name columns in customer table
UPDATE customer
SET
first_name = TRIM(first_name)
last_name = TRIM(last_name);
______________________________________________
Once you find the correct last name, you can replace it using UPDATE function
UPDATE customer
SET last_name = 'Smith'
WHERE first_name = 'Roger'
If there are more than 1 roger in the list then, you can use customer_id instead of first_name to make this correction
______________________________________________
Often, we are provided with description of the data where you can see max and min values (Range) in a column
Check by using MAX(column_name) and MIN(column_name) to ensure that data des not have incorrect values
For example, if we know that age column in the students table can have values between 10 and 12, then using MAX(age) and MIN(age) we can find if there are any incorrect values
______________________________________________
Once we know there are values less or higher than the expected range, using COUNT function, we can find out how many of these students exist in the list with incorrect age
SELECT COUNT *
FROM students
WHERE age = 13 OR age = 9
______________________________________________
If it appears that there is a significant number of incorrect data in a column, then consult data source for further guideline...should we correct it(find the correct data or replace incorrect values with average values in a column) or delete it from the table using DELETE function
DELETE students
WHERE age = 13 OR age = 9
______________________________________________
If 'age' column has been assigned string data type but the column has numbers, then using CAST function, we can change the data type of column 'age'
SELECT CAST(age AS integer)
FROM students
ORDER BY CAST(age AS integer)
______________________________________________
Some times we need to combine data of 2 columns in 1 column, then we can use CONCAT
For example, date and time are in separate column and we want t combine them together in 1 column.
SELECT CONCAT (column1,column2) AS new_column_name
FROM table_name
______________________________________________
Using COALESCE, we can avoid null values in a column by returning non null values in a list
In product table, we have product code in one column and product name in another column and in one row, you have product name available but product code missing. And we have been told that product code is not available at the moment but we can't remove that row as it is an important product.
So whenever data is pulled for product code, we want to show product in the product_id column instead of null value
SELECT COALESCE(product_code,product_name)
FROM product
______________________________________________
Using CASE statement, we can replace incorrect string with corrected string
SELECT customer_id,
CASE
WHEN first_name = 'Jonh' THEN 'John'
WHEN first_name = 'Rajeev' THEN 'Rajiv'
END AS corrected_name
FROM table_name
_________________________________________
HOW TO WRITE A SIMPLE SQL QUERY
There is a public dataset on kaggle web site which gives you the citi bike usage in New York. Bike sharing company has 2 types of customers. One who are subscribers (most likely commuters who need to use bike everyday from one station to another) and customer(those who may rent the bike as and when needed).
I am trying to find out what routes are most popular with different user types. I have this in mind...I would like to see end result in different columns.
This is what I have in mind to accomplish as end result
This is the query I wrote in SQL
I am adding start and end station name using CONCAT function and calling the column as route
I am counting rows for all columns and calling the column as num_trips as they are showing me the frequency of trips
I found that tripduration column has numerical values but its data type is string so I am going to convert its data type from string to INT using CAST function. Also I am looking for average trip duration in minutes so using AVG function, I am dividing the result by 60 since data for trip duration is in seconds
And I am removing any rows where trip duration is not available using WHERE tripduration IS NOT NULL function
And finally I am grouping and ordering the result using GROUP BY and ORDER BY functions
This is a simple SQL query and results is below
_________________________________________
EXAMPLE of a sub-query in SQL(Under FORM statement)
There is a public dataset for New York citibike that I have used for this example.
This dataset has 2 tables- One is named as citibike_stations and the other is named as citibike_trips
Our requirement is to get data for number of rides starting at each station so we know which is the busiest station for bike business and accordingly bike availability can be maintained.
We have something like this in mind for end result.
Station_id is a common column in both table but only citibike_trips has start_station_id column.
So here is the general idea for querying...
From citibike_trips table, get start_station_id and using COUNT function, get the number of trips for them.
Something like this...
Example of How to build a complex query in SQL
Dataset name: warehouse_orders
Table names: orders, warehouse
I have uploaded these tables in to bigquery and here are contents of these tables
Order table:
- No of orders fulfilled can be obtained using COUNT function from orders table
- Total number of orders can be obtained from orders table using SUM function
- Warehouse name can be obtained from warehouse table (it is already available in column name warehouse_alias
- Warehouse_id is in both tables so we can join both tables using this column
Let us join tables in the first
query so we get warehouse_id and warehouse name columns for our end result
SELECT
warehouse.warehouse_id,
warehouse.warehouse_alias
FROM
`my-sql-project-336018.warehouse_orders.warehouse` AS warehouse
LEFT JOIN
`my-sql-project-336018.warehouse_orders.orders` AS orders
ON
warehouse.warehouse_id = orders.warehouse_id
I am joining tables with LEFT JOIN...why?
Because I want to ensure that all warehouse_id in the warehouse table must be there and the corresponding warehouse_id data in orders table should be there.
Which means, if there are warehouse_id in orders table but not in
warehouse table, they are not required for our task.
Also always check that subquery in itself should run independently because we are going to use its result in the main query. We are getting below result when we run this subquery.
Next, let us find out number of orders by using COUNT function in orders table.
COUNT(orders.order_id) AS number_of_orders_fulfilled
And since using aggregation, add GROUP BY function too...
Some warehouse/s has zero fulfilled orders. So when we checked the details of the data, it says some warehouses are newly built and are not in operation yet. So we will have to remove those from the list.
If we calculate a couple of samples to make sure the result is right, out of total 9999 orders, when 548 are fulfilled, it is 5.48% of fulfillment.
And, to remove warehouses with zero orders,
First, let us create a new database with the name walmart.
We are trying create a department table for walmart as it has many departments and each department is under a division.
Now we can see that database got created
Next, we will create table under this database. From the dropdown menu on the top, selects tools and then select Query Tool
It will open new window where we can write SQL code. We are going to name this table as departments and we want 2 columns in it- department and division.
Another important thing to note is we want department column to be the primary key. There can be only 1 primary key and it acts as a constraints for the column. It prevents any duplicate or null values in the column as it is unique key.
A good practice is to keep some margin for future purpose.
May be the list you have right now is showing all small department names but in future if a new department with longer name needs to be created then you should have enough room in your column to allow it.
And we can see under walmart database schema, that the table
At the moment we have created empty table without any entry in
Next, we will add data in these columns. We have already prepared
We will always confirm after executing the query that it was successful.
And if we want to see how departments table looks like, we can
check that now.
Using this we can create as many tables we need under this database.
Simple explanation of how grouping works in SQL
To get a general idea of how grouping works in SQL, let's create a small table quickly and then use group function to see what happens.
We have now created fruits table with some entries in it.
This is how this table looks like...
We have intentionally added some Null entry in the column to see how they are going to be grouped.
Data provided to us often includes null values too and let's see how grouping affects on these values.
GROUP BY command is going to let us know how many items in the column for which groups are formed. So there are 4 groups in the column. Orange, Mango, Apple and Null
If we further add COUNT function to see how many items in each group...
Now we can clearly see how many records in each group.
It is important to note that Null is considered as a group and it has 3 records.
____________________________________________
How to upload csv file to postgres SQL
When a spreadsheet is given to you to upload to SQL, you can follow this procedure.
First convert spreadsheet to .csv (comma delimited) file. Just open the file and save as csv (comma delimited) file.
Next, check how many columns are there in your spreadsheet and then create those column names manually in SQL by following this process.
We have created instacart database here. Under Schemas, we can see tables. Right click and select CREATE TABLE option.
Just enter table name that you want to create. We have used here table name as orders. Do not click SAVE yet.
Before you click SAVE, click on columns from top menu bar. Using plus sign on the right, you can now add more columns one by one.
Now we have created all columns one by one. Just remember that the names of the columns should match exactly same as in csv file. Otherwise while uploading data, you will keep on getting errors.
Another thing is, there are options to set up column constraints by activating NULL and Primary Key options. Null option will prevent any null values in the columns and primary key will allow only unique and non-null values in the column.
Do this ONLY IF you are sure that the data in csv file is cleaned. Otherwise do not activate these constraints because you will keep on getting errors while uploading csv file.
Now you can click SAVE at the bottom of the page. Our table structure is ready.
Next, confirm it by running the SELECT command. We can see all columns created correctly but there is no data yet.
Now, we will upload data from csv file that is saved to our computer to this table.
Right click on orders and select import/export option
Select following options from import/export page.
- Turn the option to import
- Give the location of your order.csv file on your computer
- Select the format as csv
- If your csv file has header row displaying column names then select header as Yes
- And select delimiter as " , " because our csv file is comma delimited file.
Once you click OK, it is now uploading the csv file and you will get confirmation message once it is done.
Click on more details to see how many rows uploaded in how much time
It is showing 3.4 million rows uploaded in 16.93 seconds. Check your csv file and confirm if it has same number of rows then all data is uploaded. If not check the data in csv file for any issues.
Now we can view the data using SELECT command
____________________________
How to use sub-query to get required data from multiple tables
When the end result we need, requires us to take data from multiple tables and put it together in one table, this can be accomplished in 2 ways.
Either we join tables using primary keys or use sub-query.
Let's see how can we do it using sub-query by following an example.
We re given 3 tables...
students table has data related to student age, student number and student name
courses table has data related to course_number, course_title and credits per course
student_enrollment table has data related to student_number and course_number
- The students table is not directly related to the courses table. It just contains student details.
- The courses table just contains courses information.
- The table that relates both the students table and courses table is the student_enrollment table. Which student is enrolled in which course is captured in the student_enrollment table.
How to use conditional statement using CASE
Let's say we have a table that gives information about employees
We want to categorize employees based on hire_date.
Anyone who was hired before 2006-01-01 is Super skilled Employee
Anyone hired between 2006-01-01 and 2012-01-01 is Skilled Employee
Anyone hired after 2012-01-01 is New Employee
Our task is to count them based on hire date and provide results in 3 columns.
Something like this...
We can accomplish this by using CASE statement.
Let's try to derive the first column and see if it comes out as expected.
We are using output as 1 and 0 so we can use SUM function to get the total for the category super_skilled
This looks fine so we can now go ahead and add other 2 categories and get the desired result.
The end result count should total to 1000 as in our table we have 1000 records of students.
226+394+380 = 1000
This confirms that result is correct.
_________________________________________
How to use sub-query in SQL to display data in segments
We are given a table (fruit_imports) which shows us details of imports of different fruits from various states in USA.
Our task is to segregate fruits based on supply availability as per below criteria.
If supply is below 20000 then mark it as Short supply fruit
If supply is between 20000 and 50000 then mark it as Regular supply fruit
If supply is over 50000 then mark it as Excessive supply fruit
This will help the concerned department to decide on the price of fruits as well as arrange to place more orders for fruits that are in low supply.
We are looking for end result in 3 columns...
Fruit : names of all fruits
Total Supply : From the given table, get total supply of each fruit
Supply Category : SHORT , REGULAR ,EXCESSIVE
This is a very common scenario in many situations. Same query logic can be applied in all such situations.
For example, sale of certain item in segments, segments of customers based on order frequency, creating segments of inventory based on fast-low-regular movement, create groups of days in a year based on hot - average - cold temperature...the list goes on and on...
Since all column names we require are to be included in SELECT statement, that would be a good start
SELECT name AS fruit, SUM(supply) AS total_supply
and for categorization, we can use CASE statement as third column.
We are going to try one category first (SHORT SUPPLY) and see if results are as per our expectation and then repeat the same process for other two categories
SELECT name AS fruit, SUM(supply) AS total_supply, CASE
WHEN SUM(supply) < 20000 THEN 'SHORT'
END
AS supply_category
FROM fruit_imports
GROUP BY name
We can verify that when supply is less than 20000 then category is printed as SHORT. This looks good so we can now repeat the process for other two categories and complete the result.
That's how we get out result. Sometimes, if the supply numbers are dynamic and can change, then instead of calculating SUM(supply) in the main SELECT query, we can also do it in a subquery under FORM statement
Here is an alternative way...
Note that in the main SELECT query, we are using aliases for name and SUM(supply) that we used in sub-query under FORM statement.
Getting the same result either way.
_________________________________________
Difference between corelated subquery and windows function
We are given employees table as below.
And our requirement is to get an output in 3 columns. First Name, Department and total no. of employees per department.
One of the ways to achieve this is by using correlational sub-query.
Something like this...
We are relating inner query(sub-query) with outer query on department column so inner query is executed for every record that exists in outer query.
If employees table has 1000 records, then inner query is run for all 1000 records. Understandably, this is a very slow process for large data tables and also an expensive one as these days most companies are paying for the cloud usage per time used basis.
Alternative way to do this by using windows function OVER( )
The advantage is that it is much quicker and we can use it to group data using any column without using GROUP BY function.
We can use OVER( ) for any aggregate functions like SUM or AVG or MAX or MIN etc.
And if we want to ensure that all records are returned in both methods, we can use EXCEPT to see if any difference.
This confirms that both are getting same results
_________________________________________
How to find duplicate values in SQL using window function
We are given email table which includes user_id, user_name and email
and our task is to find out duplicate record from this table.
How can we get this result using windows function?
Idea:
And then if we partition them by user_name (which should be unique for each user), and if there are any duplicate user_name then row number would be different for same user_name.
Something like this...
And now, it is an easy task to pick a row number the one which is not unique
So we know that Keeny is a duplicate user_name.
Using this method we can find any duplicates in the column. Just have to make sure that the column that we are taking as a reference (user_name in this case) is supposed to have unique records.
In this table, user_id is also unique and user_name is also unique. But after checking for duplicates in user_id, we could not get any duplicates so we checked for the other unique value column and found duplicate.
In case if we did not get duplicate using user_name then we could check duplicate in email column also as each user_name is supposed to have a unique email address.
_________________________________________
How to use LEAD function in SQL
We are given a table with details of login_id, user_name and login_date. This data is from an e commerce portal where we want to know if a user logged in 3 times or more in a day.
We can see that Stewart logged in 3 times on Feb 16 hence that is our expected result.
Idea:
If the user name in current row is equal to the user name in the next row (LEAD) AND user name is current row is equal to the user name in the second next row (LEAD) AND same condition for login date, then that would be the user name we are looking for
This is how we can accomplish it in a query
You can see here we are using LEAD(user_name,1) for the next row and LEAD(user_name,2) for the next to next row.
Now, since the hard part is already completed, it is a matter of using WHERE to choose everything in the repeat_user column that is NOT NULL to get our answer.
____________________________
When to use self join in SQL
Requirement:
Find acceptance rate of face book friends requests by Date. The date to be used for calculation is the date on which the request was sent.
We have a dataset of face book friends requests list.
Let us list down steps we need to take to accomplish the intended result.
- Find total number of requests accepted (A)
- Find total number of requests sent (B)
- Divide A/B to get the results
Now, at this point we have 2 options to join these two results. Either by creating CTE or by subquery.
Complex query example
Requirement:
You are given the table with titles of recipes from a cookbook and their page numbers. You are asked to represent how the recipes will be distributed in the book.
Produce a table consisting of three columns: left_page_number, left_title and right_title. The k-th row (counting from 0), should contain the number and the title of the page with number 2*k in the first and second columns respectively, and the title of the page with number 2*k+1 in the second column.
Each page contains at most 1 recipe. If the page does not contain a recipe, the appropriate cell should remain empty (NULL value). Page 0 (the internal side of the front cover) is guaranteed to be empty.
But what if in future there will be more pages added to this recipe book? In that case, instead of hardcoding page numbers from 0 to 15, we can let the subquery decide how many page numbers are needed.
Now we got 2 columns that we required and the task from here is to add a 3rd column using conditional case statement. Since our desired output has only even page numbers, we can use page number odd or even as a criteria to create 3rd column.
Now we have segregated recipes in to left and right title page but we still have 16 rows. Our desired output has only 8 rows. We can use groupby concept to reduce number of rows from 16 to 8 and keep the data same in left_title and right_title columns
No comments:
Post a Comment