SQL, or Structured Query Language, is a programming language used to access, extract, wrangle, and explore data stored in relational databases. pandas is a Python open-source library specifically designed for data manipulation and analysis.
In this tutorial, we’re going to discuss when and how we can (and when we cannot) use SQL functionality in the framework of pandas. In addition, we’ll take a look at various examples of implementing this approach and compare the results with the equivalent code in pure pandas.
Why Use SQL in pandas?
Given the definitions in the introduction, why should one want to use SQL combined with pandas when the latter is an all-inclusive package for data analysis?
The answer is that on some occasions, especially for complex programs, SQL queries look much more straightforward and easy to read than the corresponding code in pandas. This is particularly true for those people who initially used SQL to work with data and then later learned pandas.
If you need more training on pandas, you can check out our Data Manipulation with pandas course and Pandas Tutorial: DataFrames in Python.
To see SQL readability in action, let’s suppose that we have a table (a dataframe) called penguins
containing various information on penguins (and we will work with such a table later in this tutorial). To extract all the unique species of penguins who are males and who have flippers longer than 210 mm, we would need the following code in pandas:
penguins[(penguins['sex'] == 'Male') & (penguins['flipper_length_mm'] > 210)]['species'].unique()
Instead, to get the same information using SQL, we would run the following code:
SELECT DISTINCT species FROM penguins WHERE sex = 'Male' AND flipper_length_mm > 210
The second piece of code, written in SQL, looks almost like a natural English sentence and hence is much more intuitive. We can further increase its readability by spanning it over multiple lines:
SELECT DISTINCT species FROM penguins WHERE sex = 'Male' AND flipper_length_mm > 210
Now that we identified the advantages of using SQL for pandas, let’s see how we can technically combine them both.
How to Use pandasql
The pandasql Python library allows querying pandas dataframes by running SQL commands without having to connect to any SQL server. Under the hood, it uses SQLite syntax, automatically detects any pandas dataframe, and treats it as a regular SQL table.
Setting up your environment
First, we need to install pandasql:
pip install pandasql
Then, we import the required packages:
from pandasql import sqldf import pandas as pd
Above, we directly imported the sqldf
function from pandasql, which is virtually the only meaningful function of the library. As its name suggests, it’s applied to query dataframes using SQL syntax. Apart from this function, pandasql comes with two simple built-in datasets that can be loaded using the self-explanatory functions load_births()
and load_meat()
.
pandasql Syntax
The syntax of the sqldf
function is very simple:
sqldf(query, env=None)
Here, query
is a required parameter that takes in a SQL query as a string, and env
—an optional (and rarely useful) parameter that can be either locals()
or globals()
and allows sqldf
to access the corresponding set of variables in your Python environment.
The sqldf
function returns the result of a query as a pandas dataframe.
When we can use pandasql
The pandasql library allows working with data using the Data Query Language (DQL), which is one of the subsets of SQL. In other words, with pandasql, we can run queries on the data stored in a database to retrieve the necessary information from it. In particular, we can access, extract, filter, sort, group, join, aggregate the data, and perform mathematical or logical operations on it.
When we cannot use pandasql
pandasql doesn’t allow employing any other subsets of SQL apart from DQL. This means that we can’t apply pandasql to modify (update, truncate, insert, etc.) tables or change (update, delete, or insert) the data in a table.
In addition, since this library is based on SQL syntax, we should beware of the known quirks in SQLite.
Examples of using pandasql
Now, we’ll take a more granular look at how to run SQL queries on pandas dataframes using the sqldf
function of pandasql
. To have some data to practice on, let’s load one of the built-in datasets of the seaborn library—penguins
:
import seaborn as sns penguins = sns.load_dataset('penguins') print(penguins.head())
Output:
species island bill_length_mm bill_depth_mm flipper_length_mm \ 0 Adelie Torgersen 39.1 18.7 181.0 1 Adelie Torgersen 39.5 17.4 186.0 2 Adelie Torgersen 40.3 18.0 195.0 3 Adelie Torgersen NaN NaN NaN 4 Adelie Torgersen 36.7 19.3 193.0 body_mass_g sex 0 3750.0 Male 1 3800.0 Female 2 3250.0 Female 3 NaN NaN 4 3450.0 Female
If you need to refresh your SQL skills, our SQL Fundamentals skill track is a good reference point.
Extracting data with pandasql
print(sqldf('''SELECT species, island FROM penguins LIMIT 5'''))
Output:
species island 0 Adelie Torgersen 1 Adelie Torgersen 2 Adelie Torgersen 3 Adelie Torgersen 4 Adelie Torgersen
Above, we extracted information about the species and geography of the first five penguins from the penguins
dataframe. Note that running the sqldf
function returns a pandas dataframe:
print(type(sqldf('''SELECT species, island FROM penguins LIMIT 5''')))
Output:
<class 'pandas.core.frame.DataFrame'>
In pure pandas, it would be:
print(penguins[['species', 'island']].head())
Output:
species island 0 Adelie Torgersen 1 Adelie Torgersen 2 Adelie Torgersen 3 Adelie Torgersen 4 Adelie Torgersen
Another example is extracting unique values from a column:
print(sqldf('''SELECT DISTINCT species FROM penguins'''))
Output:
species 0 Adelie 1 Chinstrap 2 Gentoo
In pandas, it would be:
print(penguins['species'].unique())
Output:
['Adelie' 'Chinstrap' 'Gentoo']
Sorting data with pandasql
print(sqldf('''SELECT body_mass_g FROM penguins ORDER BY body_mass_g DESC LIMIT 5'''))
Output:
body_mass_g 0 6300.0 1 6050.0 2 6000.0 3 6000.0 4 5950.0
Above, we sorted our penguins by body mass in descending order and displayed the top five values of body mass.
In pandas, it would be:
print(penguins['body_mass_g'].sort_values(ascending=False, ignore_index=True).head())
Output:
0 6300.0 1 6050.0 2 6000.0 3 6000.0 4 5950.0 Name: body_mass_g, dtype: float64
Filtering data with pandasql
Let’s try the same example that we mentioned in the chapter Why use SQL in pandas: extracting the unique species of penguins who are males and who have flippers longer than 210 mm:
print(sqldf('''SELECT DISTINCT species FROM penguins WHERE sex = 'Male' AND flipper_length_mm > 210'''))
Output:
species 0 Chinstrap 1 Gentoo
Above, we filtered the data based on two conditions: sex = 'Male'
and flipper_length_mm > 210
.
The same code in pandas would look a bit more overwhelming:
print(penguins[(penguins['sex'] == 'Male') & (penguins['flipper_length_mm'] > 210)]['species'].unique())
Output:
['Chinstrap' 'Gentoo']
Grouping and aggregating data with pandasql
Now, let’s apply data grouping and aggregation to find the longest bill for each species in the dataframe:
print(sqldf('''SELECT species, MAX(bill_length_mm) FROM penguins GROUP BY species'''))
Output:
species MAX(bill_length_mm) 0 Adelie 46.0 1 Chinstrap 58.0 2 Gentoo 59.6
The same code in pandas:
print(penguins[['species', 'bill_length_mm']].groupby('species', as_index=False).max())
Output:
species bill_length_mm 0 Adelie 46.0 1 Chinstrap 58.0 2 Gentoo 59.6
Performing mathematical operations with pandasql
With pandasql, we can easily perform mathematical or logical operations on the data. Let’s imagine that we want to calculate the bill length-to-depth ratio for each penguin and display the top five values of this measurement:
print(sqldf('''SELECT bill_length_mm / bill_depth_mm AS length_to_depth FROM penguins ORDER BY length_to_depth DESC LIMIT 5'''))
Output:
length_to_depth 0 3.612676 1 3.510490 2 3.505882 3 3.492424 4 3.458599
Note that this time, we used the alias length_to_depth
for the column with the ratio values. Otherwise, we would get a column with a monstrous name bill_length_mm / bill_depth_mm
.
In pandas, we would need first to create a new column with the ratio values:
penguins['length_to_depth'] = penguins['bill_length_mm'] / penguins['bill_depth_mm'] print(penguins['length_to_depth'].sort_values(ascending=False, ignore_index=True).head())
Output:
0 3.612676 1 3.510490 2 3.505882 3 3.492424 4 3.458599 Name: length_to_depth, dtype: float64
Conclusion
To wrap up, in this tutorial, we explored why and when we can combine the functionality of SQL for pandas to write better, more efficient code. We discussed how to set up and use the pandasql library for this purpose and what limitations this package has. Finally, we considered numerous popular examples of the practical application of pandasql and, in each case, compared the code with its pandas counterpart in each case.
Now you have everything you need to apply SQL for pandas in real-world projects. A great place for your practice is the DataLab, DataCamp’s AI-enabled data notebook with great SQL support.
Source:
https://www.datacamp.com/tutorial/how-to-use-sql-in-pandas-using-pandasql-queries