Hey everyone!
It's Josep, back with you for another week! ππ»
I'm writing from Amsterdam, down in the Zuid area, where I'm helping my brother get his new place ready to call home!
Today, stay with me for 6 minutes βtrust me, itβll be worth your time!
The most important new of the week?
This weekend, Iβm finally moving into my new apartment in Rotterdam! I'll be back to my regular routine of sending one (or maybe even more) updates each week. Get ready!
Now that weβre caught up, letβs get into the important stuff! π¨π»βπ»
After emphasizing the importance of learning SQLβand recently covering you should get started with Pythonβtoday, I want to share a quick tip on a library that brings the best of both worlds: PandasQL.
This might be just what some of you need!
Before we dive in, I recommend checking out my previous posts if you havenβt already. Theyβll give you a solid foundation!
Combining both SQL and Python
SQL, or Structured Query Language, has long been the go-to for data management, yet sometimes it falls short and you need the power and flexibility of Python.
Python, on the other hand, shines when it comes to accessing, extracting, wrangling, and exploring data from relational databases. One of Pythonβs most popular open-source libraries, Pandas, is specifically built for data manipulation and analysis.
In this brief issue, weβll dive into how and when SQL functionality can be seamlessly integrated within the Pandas frameworkβand where it has its limitations.
Soβ¦ the main question you might be wondering right now isβ¦
WHY USING BOTH?
The reason lies in readability and familiarity. In certain cases, especially within complex workflows, SQL queries can be much clearer and easier to read than the equivalent Pandas code. This is especially true for those who started their data journey with SQL before transitioning to Pandas.
Furthermore, since most data originates from databases, SQLβbeing the native language of these systemsβhas a natural advantage. This is why many data professionals, especially data scientists, choose to integrate both SQL and Python (specifically, Pandas) within their data pipelines to leverage the strengths of each.
To see this readability in action, letβs work with a PokΓ©mon Gen 1 PokΓ©dex CSV file as an example.
Imagine we want to sort the DataFrame by the "Total" column in ascending order and display the top 5 entries. Letβs compare how to perform this action in both Pandas and SQL.
Using Python Pandas:Β
data[["#", "Name", "Total"]].sort_values(by="Total", ascending=True).head(5)
Using SQL:
SELECT
"#",
Name,
Total
FROM data
ORDER BY Total
LIMIT 5
You see how different both are right?
But⦠how can we combine both languages within our working environment with Python?
The solution is using PandaSQL!
USING PANDASQL
Pandas is a powerful open-source data analysis and manipulation python library. pandasql allows the use of SQL syntax to query Pandas DataFrames.Β
For people new to Pandas, pandasql tries to make data manipulation and cleanup more familiar. You can use pandasql to query Panda data frames using SQL syntax.Β
So letβs try to get startedβ¦
#1. SETTING UP PANDASQL
First, we need to install pandasql:
! pip install pandasql
Then (and just as always!!), we import the required packages
from pandasql import sqldf
Here, we directly imported the sqldf
function from pandasql, which is essentially the library's core feature. As the name suggests, sqldf
allows you to query DataFrames using SQL syntax.Β
sqldf(query_string, env=None)
In this context, query_string
is a required parameter that accepts a SQL query in string format. The env parameter, optional and seldom used, can be set to either locals()
or globals()
, enabling sqldf to access variables from the specified scope in your Python environment.
Beyond this function, pandasql also includes two basic built-in datasets that can be loaded with the straightforward functions load_births()
and load_meat()
. So you can have some dummy data to play with!
So now, if we want to execute the previous SQL query within our python Jupyter Notebook, it would be something like follows:Β
from pandasql import sqldf
import pandas as pd
sqldf('''
SELECT "#", Name, Total
FROM data
ORDER BY Total
LIMIT 5''')
The sqldf function returns the result of a query as a pandas dataframe.
#2. WHEN SHOULD WE USE IT
The pandasql library enables data manipulation using SQL's Data Query Language (DQL), providing a familiar, SQL-based approach to interact with data in Pandas DataFrames.Β
With pandasql, you can execute queries directly on your dataset, allowing for efficient data retrieval, filtering, sorting, grouping, joining, and aggregation.Β
Additionally, it supports performing mathematical and logical operations, making it a powerful tool for SQL-savvy users working with data in Python.
#3. WHEN SHOULD WE NOT USE IT
pandasql is limited to SQLβs Data Query Language (DQL) subset, meaning it does not support modifying tables or dataβactions like UPDATE, INSERT, or DELETE are not available.Β
Additionally, since pandasql relies on SQL syntax, specifically SQLite, itβs essential to be mindful of SQLite-specific quirks that may affect query behavior.
Final Conclusions
By comparing these approaches, you can see that pandasql is helpful for SQL-native users or scenarios with complex queries, while native Pandas code can be more Pythonic and integrated for those accustomed to working in Python.
You can check some comparisons of using both libraries in Python in the following Jupyter Notebook.
Still with me? π§
As fellow data enthusiasts, Iβm sure youβd be eager to help me shape some impactful KPIs and take this newsletter to the next level!
So hereβs how you can help:
ππ» I want this newsletter to be truly valuable for you, so please share your feedback!
Additionally, you can let me know any preference for future content or any idea you think might add more value to you!
Before you go, tap the π button at the bottom of this email to show your supportβit really helps and means a lot!
My latest articles π
Top 5 Free Machine Learning Courses to Level Up Your Skills in KDnuggets.
The Transformers Architecture - Whatβs the magic behind LLMs? in AIgents.
Nice articles (my weekly favs)! β₯
My Favorite Resources! π¨π»βπ»
I strongly recommend following AIgents roadmaps if you want to become a full Data Scientist, ML Engineer, or Data Engineer,
Want to enhance your coding workflow? You better check pieces for developers.
Understand ML models following my MLBasics series!
You can follow the Data Engineer Zoomcap on GitHub to become a fully proficient Data Engineer.
Want to learn GCP? You can follow The Cloud Girl and learn using her intuitive illustrations!
Want to get more of my content? ππ»ββοΈ
Reach me on:
LinkedIn, X (Twitter), orΒ Threads to get daily posts about Data Science.
My MediumΒ Blog to learn more about Data Science, Machine Learning, and AI.
Just email me at rfeers@gmail.com for any inquiries or to ask for help! π€
Remember now that DataBites has an official X (Twitter) account and LinkedIn page. Follow us there to stay updated and help spread the word! ππ»
Hi!
I would like to understand how does dates and hours and time period data types work in Python, to be able to handle calculations, coincidence of data sets based on time difference and chart visualization.
I work in Civil Engineering, monitoring structures with data every 10-ish min, in different dates; so I want to be able to compare the data of different dates based on the time difference between them. Considering that the data is sometimes every 12min or 16min12sec, or two data points during the same minute, my data set is not always the same length, and the reference I get and need to cross-reference is date and time.
Love the newsletter series! Thank you for mentioning my article as well.