Make SQL simpler in 5 minutes.

Published in

Data Science Simplified

7 min readNov 20, 2023

Structured Query Language or SQL is by far, one of the most used languages in analytics out there. If you haven’t already got SQL in your arsenal of tools to use for data querying and extraction, you’re missing out on a really good weapon to have.

MySQL, PostgreSQL, Snowflake, etc. all these RDBMS tools use SQL to store enormous amounts of data for any large company today.

If you already know SQL — GREAT!

Otherwise, check this article out:

Excel V/S MySQL

- What can SQL do that Excel cannot?

medium.com

What do you think it is?

SELECT, UPDATE, CREATE, DELETE, WHERE, GROUP BY, ORDER BY, CASE WHEN, HAVING, etc.

Do they seem familiar? Awesome! (Psst, it’s not too difficult to learn them, you can check out the Snowflake docs link given below, I find it fairly simple and we’ll continue in this article with the Snowflake syntax)

But that’s not what this article is about, this one’s about a querying technique that most people miss out on because it’s not expected to be a game changer, but is one of the most fascinating ways of data manipulation in SQL. In fact, it is used by every SQL user out there, who works with SQL in analytics or engineering.

The CTE.

Common Table Expressions (or so they are called)

Let’s get right into it.

A CTE or a common table expression refers to a simple SQL construct that allows you to create a temporary table before performing operations on it. Think of it like an Excel table, that you filter or change something in, before re-manipulating it! Or just as a nested SQL query but you have way better control over the data.

Let’s see this with a basic example and then grow the idea to the next level.

---The idea is to write an SQL query to pull 3 columns from a table
---with exclusions based on multiple conditions

SELECT x,y,z FROM table_1
WHERE x NOT IN (SELECT 
                DISTINCT exc 
                FROM table_2
                WHERE exc IS NOT NULL)

This could, with a CTE be rewritten as;

WITH CTE AS 
(
SELECT DISTINCT exc
FROM table_2
WHERE exc IS NOT NULL
)

SELECT x,y,z FROM table_1
WHERE x NOT IN (SELECT * FROM CTE)

Not so different, but stay with the idea…

Going ahead with a better example to emphasize the better quality of the query attained after a little effort.

Let’s say you have multiple conditions to place on your table and mix some data types to make it a little complicated. Assume you have some purchase data which is given in multiple tables and you have aggregated the data based on purchase_date and found all transactions > 500 with the total amount and their return status.

SELECT 
purchase_date,
return_status,
COUNT(DISTINCT(purchase_id)) transactions, 
SUM(purchase_amount) amount_for_large_transactions
FROM purchases_table 
LEFT JOIN (SELECT purchase_id, return_status FROM returns_table) USING (purchase_id)
WHERE purchase_id NOT IN (SELECT 
                  DISTINCT purchase_id
                  FROM invalid_transactions
                  WHERE purchase_id IS NOT NULL)
AND purchase_id IN (SELECT 
                  DISTINCT purchase_id
                  FROM completed_transactions
                  WHERE purchase_id IS NOT NULL)
AND purchase_amount > 500
GROUP BY 1,2
HAVING return_status <> 'N/A'

STEP 1: The Joins.

You can always make your entire code more readable and less complicated when you change the complicated joins to work out of a CTE.

Let’s try that on the example query given above

WITH 
PURCHASE_MONTH_DATA AS
(
SELECT 
purchase_id, purchase_date, purchase_month  
FROM purchase_fact 
  LEFT JOIN (
            SELECT
            purchase_date,purchase_month 
            FROM date_dimension_table
            ) 
            USING (purchase_date)
),

RETURN_STATUS_DATA AS
(
SELECT 
purchase_id,return_status 
FROM returns_table
)

SELECT 
purchase_date,
return_status,
COUNT(DISTINCT(purchase_id)) transactions, 
SUM(purchase_amount) amount_for_large_transactions
FROM purchases_table 
  LEFT JOIN RETURNS_STATUS_DATA USING (purchase_id)
  LEFT JOIN PURCHASE_MONTH_DATA USING (purchase_id)
WHERE purchase_id NOT IN (SELECT 
                  DISTINCT purchase_id
                  FROM invalid_transactions
                  WHERE purchase_id IS NOT NULL)
AND purchase_id IN (SELECT 
                  DISTINCT purchase_id
                  FROM completed_transactions
                  WHERE purchase_id IS NOT NULL)
AND purchase_amount > 500
GROUP BY 1,2
HAVING return_status <> 'N/A'

However, even with the CTE titled PURCHASE_MONTH_DATA, there is still a little less readability within the CTE. There is no rule that suggests that you can only have one CTE in the SQL query or that you cannot have more than one CTE within a CTE.

A nested CTE can be complicated if attempted, but you can always break it down into different steps and make use of it. Let’s do that to our example and continue.

WITH
DATE_DIM AS 
(
SELECT
purchase_date,purchase_month 
FROM date_dimension_table
),

PURCHASE_MONTH_DATA AS
(
SELECT 
purchase_id, purchase_date, purchase_month  
FROM purchase_fact 
  LEFT JOIN DATE_DIM USING (purchase_date)
),

RETURN_STATUS_DATA AS
(
SELECT 
purchase_id,return_status 
FROM returns_table
)

SELECT 
purchase_date,
return_status,
COUNT(DISTINCT(purchase_id)) transactions, 
SUM(purchase_amount) amount_for_large_transactions
FROM purchases_table 
  LEFT JOIN RETURNS_STATUS_DATA USING (purchase_id)
  LEFT JOIN PURCHASE_MONTH_DATA USING (purchase_id)
WHERE purchase_id NOT IN (SELECT 
                  DISTINCT purchase_id
                  FROM invalid_transactions
                  WHERE purchase_id IS NOT NULL)
AND purchase_id IN (SELECT 
                  DISTINCT purchase_id
                  FROM completed_transactions
                  WHERE purchase_id IS NOT NULL)
AND purchase_amount > 500
GROUP BY 1,2
HAVING return_status <> 'N/A'

That’s a lot better. You can call a CTE in different subqueries to avoid redundancy of querying as well!

The next step is fairly simple, to make more CTEs.

This time, we’ll remove the chunks of code that target the invalid transactions and only include the completed transactions in our example.

WITH
DATE_DIM AS 
(
SELECT
purchase_date,purchase_month 
FROM date_dimension_table
),

PURCHASE_MONTH_DATA AS
(
SELECT 
purchase_id, purchase_date, purchase_month  
FROM purchase_fact 
  LEFT JOIN DATE_DIM USING (purchase_date)
),    
  
RETURN_STATUS_DATA AS
(
SELECT 
purchase_id,return_status 
FROM returns_table
),
  
INVALID_TRANSACTIONS_DATA AS 
(
SELECT 
DISTINCT purchase_id 
FROM invalid_transactions 
WHERE purchase_id IS NOT NULL
),

COMPLETED_TRANSACTIONS_DATA AS
(
SELECT 
DISTINCT purchase_id 
FROM completed_transactions 
WHERE purchase_id IS NOT NULL
)
  
SELECT 
  purchase_month,
  return_status,
  COUNT(DISTINCT(purchase_id)) transactions, 
  SUM(purchase_amount) amount_for_large_transactions
FROM purchases_table 
  LEFT JOIN RETURNS_STATUS_DATA USING (purchase_id)
  LEFT JOIN PURCHASE_MONTH_DATA USING (purchase_id)
WHERE 
    purchase_id NOT IN (SELECT * FROM INVALID_TRANSACTIONS_DATA)
AND purchase_id IN (SELECT * FROM COMPLETED_TRANSACTIONS_DATA)
AND purchase_amount > 500
  GROUP BY 1,2
  HAVING return_status <> 'N/A'

Et voila! And just as simple as that, you have a query that exactly tells you what goes where and has better readability on the whole. There’s a much bigger advantage to using CTEs and it revolves around debugging.

There were times when SQL queries that my team has been a part of preparing went as long as 1700 lines of a single query. The idea behind this might be something that could be broken down but to streamline it, maybe it was necessary. Either way, debugging the query would be next to impossible for a query that long without clearly pointing out… where things went wrong.

How do CTEs help with debugging?

Let’s take the same example and assume that you get an error around the return status of the purchases. This is fairly simple to solve with the CTE because you know the table that brings the return status of the customer is in the RETURN_STATUS_DATA table.

To check that table all you can do is;

WITH RETURN_STATUS_DATA AS
(
SELECT 
purchase_id,return_status 
FROM returns_table
)

SELECT * FROM RETURN_STATUS_DATA

Once you’re sure that the data looks right, you can proceed to use it in the next steps.

There’s one small step that remains after all this is done anyway. This may be a little extra to-do but trust me, it’s worth the effort.

COMMENTING

Preparing production-level SQL queries is not complete without commenting. The importance of commenting and why should it be mandatory for anyone… that’s out of the scope of this article, but let’s just comment out our final query so that it not only looks readable but understandable.

---Calculating monthly purchases and total amount from large transactions (>500) 
--- where the transactions are completed and not invalid

WITH
---taking dimension data for dates to pull month date, can be changed to other available dimensions as necessary
  DATE_DIM AS 
  ( SELECT purchase_date,purchase_month FROM date_dimension_table),
  
---pulling purchase_id's and the month they belong to
  PURCHASE_MONTH_DATA AS
  (SELECT purchase_id, purchase_date, purchase_month  FROM purchase_fact 
    LEFT JOIN DATE_DIM USING (purchase_date)),    
    
---tracing return status of the purchases to be used as a dimension in the final aggregations
  RETURN_STATUS_DATA AS
  (SELECT purchase_id,return_status FROM returns_table),
  
---tracing invalid transactions to exclude the purchases from the final aggregations
  INVALID_TRANSACTIONS_DATA AS 
  (SELECT DISTINCT purchase_id FROM invalid_transactions WHERE purchase_id IS NOT NULL),
  
---tracing completed transactions to ensure only completed transactions are being used in the aggregations
  COMPLETED_TRANSACTIONS_DATA AS
  (SELECT DISTINCT purchase_id FROM completed_transactions WHERE purchase_id IS NOT NULL)
  
---aggregating everything together
SELECT 
  purchase_month,
  return_status,
  COUNT(DISTINCT(purchase_id)) transactions, 
  SUM(purchase_amount) amount_for_large_transactions
FROM purchases_table 
----joining with CTEs to pull return status and month dimension on data
  LEFT JOIN RETURNS_STATUS_DATA USING (purchase_id)
  LEFT JOIN PURCHASE_MONTH_DATA USING (purchase_id)
----Excluding invalids and including only completed transactions
WHERE purchase_id NOT IN (SELECT * FROM INVALID_TRANSACTIONS_DATA)
  AND purchase_id IN (SELECT * FROM COMPLETED_TRANSACTIONS_DATA)
----filtering for large transactions
  AND purchase_amount > 500
GROUP BY 1,2
HAVING return_status <> 'N/A'

P.S. I know the comments can be better, but hey, I’m just trying to make a point here!

When all this is done and dusted, you can share this with your team to understand your perspective on the analysis you’ve prepared or just the table and see what they think about it and develop ahead.

From where we started, you can see that a query that was fairly tough to read was transformed into one that is easier to comprehend.

Leave a comment if you think I missed out on any other pointers that are relevant to the article or if you have used CTEs before and if you have any tips or tricks around them, thanks!

Do leave a clap and check out my previous articles here, if they interest you!

3 Ways to Create Better Data Stories!

Find out how you can write better data stories to tell compelling stories with data and share insights with the larger…

medium.com

Top 5 things that can make better analyses

If you want to take your analyses to the next level, find out what are the 5 key steps you can follow with any and…

medium.com

For all my articles:

Data Science Simplified

Understanding how Data Science and Statistics work, sharing my knowledge, and showing you how it is used everywhere and…

medium.co

Connect with me on LinkedIn: https://www.linkedin.com/in/yash-gupta-dss/

~ P.S. All the views mentioned in the article are my sole opinions. I enjoy sharing my perspectives on Data Science. Do contact me on LinkedIn at — Yash Gupta — if you want to discuss all things related to data further!

Make SQL simpler in 5 minutes.

Excel V/S MySQL

- What can SQL do that Excel cannot?

The CTE.

The next step is fairly simple, to make more CTEs.

How do CTEs help with debugging?

3 Ways to Create Better Data Stories!

Find out how you can write better data stories to tell compelling stories with data and share insights with the larger…

Top 5 things that can make better analyses

If you want to take your analyses to the next level, find out what are the 5 key steps you can follow with any and…

Data Science Simplified

Understanding how Data Science and Statistics work, sharing my knowledge, and showing you how it is used everywhere and…

Written by Yash Gupta