Hey you all! 👋🏻
It's Josep, back again (though running a bit behind this week)!
Japan has been more intense than I expected, and I’ve had less time to prepare, which is why I’ve been a bit slower getting this new issue out.
But the good news? Today’s issue is live, and I’ve already started prepping the upcoming ones, which I promise will be right on time!
Take just 7 minutes to dive in—trust me, it’s worth it!
Over the past few weeks, I’ve been to the USA (NYC, Washington, Philadelphia) 🇺🇸, spent some downtime in Bermuda 🌴 enjoying the stunning beaches, and I’ve been in Japan 🇯🇵 for the past 6 days with more adventures to come!
Now that we’ve caught up on life updates, let's dive into the important stuff 👨🏻💻.
Back in mid-August, we discussed the importance of SQL—and it turned out to be my most popular issue yet! 🙌🏻
Today, I want to dive into one of its fundamental concepts:
The execution order of queries.
Before we dive in, I strongly recommend checking out my previous issue on Getting Started with SQL (if you haven't already).
And now…
Let’s get started with the funny part 💥
Understanding SQL’s Execution Order 🔢
Today, SQL is an indispensable skill for any data professional working with large datasets.
But after writing countless SQL scripts, you might hit a performance plateau.
You’re extracting insights using the same methods and encountering the same types of errors.
Sound familiar?
We’ve all been there—carefully crafting a query that works flawlessly on a small dataset, only to watch it crawl when scaled up to a larger one.
The secret to overcoming this challenge lies in understanding how the database interprets and executes your SQL query.
SQL is a declarative language, meaning you specify what you want, not how to get it. While SQL statements follow a specific order in writing, their execution order is different.
The most common query structure looks as follows:
Even though a person would read — and code — following the previous structure, its execution order changes quite a bit.
For instance, though written first, the SELECT
clause isn’t evaluated until much later.
So, if we try to understand SQL’s query execution order visually, it would be something like follows:
Now let’s go command by command.
SQL execution order — command by command
#1. FROM
The FROM
clause is the starting point as it identifies the data source. Even though simple queries might reference a single table, the information needed often exists across multiple tables.
This is why the JOIN
command is usually combined with FROM
to unite the source tables. They are always executed first to determine the data of interest.
#2. WHERE
Next, the WHERE
clause comes in. Its main function is to filter the base table or joined output, returning only those rows that meet a given condition.
⚠️ One important thing to remember about the WHERE
clause is that it cannot be used to filter aggregated columns.
Want to know why?
Just keep reading… 👀
#3. GROUP BY
The GROUP BY
clause is executed to group the data based on the values in one or more columns. It allows us to aggregate or summarize quantities.
In simple terms, it collapses the variable or variables and returns a single value for each distinct element or combination of elements.
As you’ve probably recognized, this clause allows us to aggregate, or summarise, a quantity and is used in combination with functions such as COUNT()
, SUM()
,MIN()
, MAX()
and the like.
#4. HAVING
The HAVING
clause is executed to remove the created grouped records that don’t meet the constraints.
The HAVING
clause allows us to filter results using grouped and aggregated data because it is executed after the GROUP BY
clause.
The database is now aware of the aggregations meaning they can be used in all statements that follow.
And answering the previous question about why the WHERE
command cannot filter aggregated variables, this is because it is used before the GROUP BY
.
#5. SELECT
The SELECT
clause is where we define the columns we want in our table, along with any grouped and aggregated fields.
Here, we can apply column aliases using the AS
operator.
The SELECT
command is usually used together with the DISTINCT
, which allows us to discard any row with duplicated values in all columns marked as DISTINCT
.
#6. ORDER BY
With most of the heavy lifting done, the ORDER BY
clause is executed to sort the derived values in ascending (ASC
) or descending order (DESC
)
We have now arrived at the final stage.
We have obtained data from our sourced tables, applied filters, performed grouping and aggregation on selected fields, and specified the columns that we wish to include in our ultimate table.
#7. LIMIT
Finally, the LIMIT
clause helps define the number of rows we want back. It is particularly useful when dealing with large tables, especially during development and testing stages.
When dealing with large tables it’s often not optimal to have the query return all rows, particularly when you’re only developing and testing.
Simple Example of SQL’s Execution Order
Let’s imagine we have two different tables regarding a phone company. One with the clients information — CLIENTS — and the other one with their phone usage — USAGE.
Using both tables, we are going to find out what clients have overpassed their monthly usage limit. In this case, it is as easy as using the following query, where you can observe we have used all commands:
SELECT DISTINCT
usage.id,
clients.owner_name,
SUM(usage.usage_minutes) AS total_browsing_time,
clients.browsing_minutes_limit
FROM usage
LEFT JOIN clients
ON clients.ID = usage.ID
WHERE usage.Date > "2023-05-01" AND usage.Date < "2023-06-01"
GROUP BY 1,2,4
HAVING SUM(usage.usage_minutes) > clients.browsing_minutes_limit
ORDER BY clients.browsing_minutes_limit DESC
With the corresponding output
As previously stated, a user might read the query starting from the SELECT
clause and ending with the LIMIT
clause, but SQL doesn’t read clauses from top to bottom.
Instead, it executes it in the following order:
1. The FROM
and JOIN
clauses merge the Clients and Usage tables to obtain all required data.
2. The WHERE
clause removes any usage record that does not correspond to May.
3. The GROUP BY
clause aggregates all daily usage by id, owner-name and browsing_minutes_limit.
4. The HAVING
clause removes all clients that have spent less than their monthly usage limit.
5. The SELECT
clause derives all the data that we want.
6. The ORDER BY
clause sorts the results based on the usage limit.
7. The LIMIT
clause shows only the first 5 rows.
Easy… right?
Next week, we’ll explore how to write clean, reusable SQL queries—stay tuned! 💥
Are you still here? 🧐
👉🏻 I want this newsletter to be useful, so please let me know your feedback!
Additionally, you can let me know any preference for future content or any idea you think might add more value to you!
My latest articles 📝
Optimizing Your LLM for Performance and Scalability in KDnuggets.
How ChatGPT is Changing the Face of Programming in KDnuggets.
The Transformers Architecture - What’s the magic behind LLMs? in AIgents.
Nice articles (my weekly favs)! ♥
How to Think and Act Like a Data Scientist to Land Your First Job by
and .Day in the Life of a Data Scientist by
LangChain's text processing methods I by
My Favorite Resources! 👨🏻💻
I strongly recommend following AIgents roadmaps if you want to become a full Data Scientist, ML Engineer, or Data Engineer,
Understand ML models following my MLBasics series!
You can follow the Data Engineer Zoomcap on GitHub to become a fully proficient Data Engineer.
Want to learn GCP? You can follow The Cloud Girl and learn using her intuitive illustrations!
Want to get more of my content? 🙋🏻♂️
Reach me on:
LinkedIn, X (Twitter), or Threads to get daily posts about Data Science.
My Medium Blog to learn more about Data Science, Machine Learning, and AI.
Just email me at rfeers@gmail.com for any inquiries or to ask for help! 🤓
Remember now that DataBites has an official X (Twitter) account and LinkedIn page. Follow us there to stay updated and help spread the word! 🙌🏻
Great guide Josep!
And I’m glad to hear you are enjoying your traveling ✈️
thanks for the share