How to Use SQL in pandas Using pandasql Queries

Tutorials

Python SQL

SQL, or Structured Query Language, is a programming language used to access, extract, wrangle, and explore data stored in relational databases. pandas is a Python open-source library specifically designed for data manipulation and analysis.

In this tutorial, we’re going to discuss when and how we can (and when we cannot) use SQL functionality in the framework of pandas. In addition, we’ll take a look at various examples of implementing this approach and compare the results with the equivalent code in pure pandas.

Why Use SQL in pandas?

Given the definitions in the introduction, why should one want to use SQL combined with pandas when the latter is an all-inclusive package for data analysis?

The answer is that on some occasions, especially for complex programs, SQL queries look much more straightforward and easy to read than the corresponding code in pandas. This is particularly true for those people who initially used SQL to work with data and then later learned pandas.

If you need more training on pandas, you can check out our Data Manipulation with pandas course and Pandas Tutorial: DataFrames in Python.

To see SQL readability in action, let’s suppose that we have a table (a dataframe) called penguins containing various information on penguins (and we will work with such a table later in this tutorial). To extract all the unique species of penguins who are males and who have flippers longer than 210 mm, we would need the following code in pandas:

penguins[(penguins['sex'] == 'Male') & (penguins['flipper_length_mm'] > 210)]['species'].unique()

Instead, to get the same information using SQL, we would run the following code:

SELECT DISTINCT species FROM penguins WHERE sex = 'Male' AND flipper_length_mm > 210

The second piece of code, written in SQL, looks almost like a natural English sentence and hence is much more intuitive. We can further increase its readability by spanning it over multiple lines:

SELECT DISTINCT species
  FROM penguins 
 WHERE sex = 'Male' 
   AND flipper_length_mm > 210

Now that we identified the advantages of using SQL for pandas, let’s see how we can technically combine them both.

How to Use pandasql

The pandasql Python library allows querying pandas dataframes by running SQL commands without having to connect to any SQL server. Under the hood, it uses SQLite syntax, automatically detects any pandas dataframe, and treats it as a regular SQL table.

Setting up your environment

First, we need to install pandasql:

pip install pandasql

Then, we import the required packages:

from pandasql import sqldf
import pandas as pd

Above, we directly imported the sqldf function from pandasql, which is virtually the only meaningful function of the library. As its name suggests, it’s applied to query dataframes using SQL syntax. Apart from this function, pandasql comes with two simple built-in datasets that can be loaded using the self-explanatory functions load_births() and load_meat().

pandasql Syntax

The syntax of the sqldf function is very simple:

sqldf(query, env=None)

Here, query is a required parameter that takes in a SQL query as a string, and env—an optional (and rarely useful) parameter that can be either locals() or globals() and allows sqldf to access the corresponding set of variables in your Python environment.

The sqldf function returns the result of a query as a pandas dataframe.

When we can use pandasql

The pandasql library allows working with data using the Data Query Language (DQL), which is one of the subsets of SQL. In other words, with pandasql, we can run queries on the data stored in a database to retrieve the necessary information from it. In particular, we can access, extract, filter, sort, group, join, aggregate the data, and perform mathematical or logical operations on it.

When we cannot use pandasql

pandasql doesn’t allow employing any other subsets of SQL apart from DQL. This means that we can’t apply pandasql to modify (update, truncate, insert, etc.) tables or change (update, delete, or insert) the data in a table.

In addition, since this library is based on SQL syntax, we should beware of the known quirks in SQLite.

Examples of using pandasql

Now, we’ll take a more granular look at how to run SQL queries on pandas dataframes using the sqldf function of pandasql. To have some data to practice on, let’s load one of the built-in datasets of the seaborn library—penguins:

import seaborn as sns
penguins = sns.load_dataset('penguins')
print(penguins.head())

Output:

   species island     bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
0       3750.0    Male  
1       3800.0  Female  
2       3250.0  Female  
3          NaN     NaN  
4       3450.0  Female

If you need to refresh your SQL skills, our SQL Fundamentals skill track is a good reference point.

Extracting data with pandasql

print(sqldf('''SELECT species, island 
FROM penguins 
LIMIT 5'''))

Output:

  species  island
0  Adelie  Torgersen
1  Adelie  Torgersen
2  Adelie  Torgersen
3  Adelie  Torgersen
4  Adelie  Torgersen

Above, we extracted information about the species and geography of the first five penguins from the penguins dataframe. Note that running the sqldf function returns a pandas dataframe:

print(type(sqldf('''SELECT species, island 
                      FROM penguins 
                     LIMIT 5''')))

Output:

<class 'pandas.core.frame.DataFrame'>

In pure pandas, it would be:

print(penguins[['species', 'island']].head())

Output:

   species island
0  Adelie  Torgersen
1  Adelie  Torgersen
2  Adelie  Torgersen
3  Adelie  Torgersen
4  Adelie  Torgersen

Another example is extracting unique values from a column:

print(sqldf('''SELECT DISTINCT species 
                 FROM penguins'''))

Output:

     species
0     Adelie
1  Chinstrap
2     Gentoo

In pandas, it would be:

print(penguins['species'].unique())

Output:

['Adelie' 'Chinstrap' 'Gentoo']

Sorting data with pandasql

print(sqldf('''SELECT body_mass_g 
                 FROM penguins 
                ORDER BY body_mass_g DESC 
                LIMIT 5'''))

Output:

   body_mass_g
0       6300.0
1       6050.0
2       6000.0
3       6000.0
4       5950.0

Above, we sorted our penguins by body mass in descending order and displayed the top five values of body mass.

In pandas, it would be:

print(penguins['body_mass_g'].sort_values(ascending=False, 
ignore_index=True).head())

Output:

0    6300.0
1    6050.0
2    6000.0
3    6000.0
4    5950.0
Name: body_mass_g, dtype: float64

Filtering data with pandasql

Let’s try the same example that we mentioned in the chapter Why use SQL in pandas: extracting the unique species of penguins who are males and who have flippers longer than 210 mm:

print(sqldf('''SELECT DISTINCT species
                 FROM penguins 
                WHERE sex = 'Male' 
                  AND flipper_length_mm > 210'''))

Output:

     species
0  Chinstrap
1     Gentoo

Above, we filtered the data based on two conditions: sex = 'Male' and flipper_length_mm > 210.

The same code in pandas would look a bit more overwhelming:

print(penguins[(penguins['sex'] == 'Male') & (penguins['flipper_length_mm'] > 210)]['species'].unique())

Output:

['Chinstrap' 'Gentoo']

Grouping and aggregating data with pandasql

Now, let’s apply data grouping and aggregation to find the longest bill for each species in the dataframe:

print(sqldf('''SELECT species, MAX(bill_length_mm)
                 FROM penguins 
                GROUP BY species'''))

Output:

     species  MAX(bill_length_mm)
0     Adelie                 46.0
1  Chinstrap                 58.0
2     Gentoo                 59.6

The same code in pandas:

print(penguins[['species', 'bill_length_mm']].groupby('species', as_index=False).max())

Output:

     species  bill_length_mm
0     Adelie            46.0
1  Chinstrap            58.0
2     Gentoo            59.6

Performing mathematical operations with pandasql

With pandasql, we can easily perform mathematical or logical operations on the data. Let’s imagine that we want to calculate the bill length-to-depth ratio for each penguin and display the top five values of this measurement:

print(sqldf('''SELECT bill_length_mm / bill_depth_mm AS length_to_depth
                 FROM penguins
                ORDER BY length_to_depth DESC
                LIMIT 5'''))

Output:

   length_to_depth
0         3.612676
1         3.510490
2         3.505882
3         3.492424
4         3.458599

Note that this time, we used the alias length_to_depth for the column with the ratio values. Otherwise, we would get a column with a monstrous name bill_length_mm / bill_depth_mm.

In pandas, we would need first to create a new column with the ratio values:

penguins['length_to_depth'] = penguins['bill_length_mm'] / penguins['bill_depth_mm']
print(penguins['length_to_depth'].sort_values(ascending=False, ignore_index=True).head())

Output:

0    3.612676
1    3.510490
2    3.505882
3    3.492424
4    3.458599
Name: length_to_depth, dtype: float64

Conclusion

To wrap up, in this tutorial, we explored why and when we can combine the functionality of SQL for pandas to write better, more efficient code. We discussed how to set up and use the pandasql library for this purpose and what limitations this package has. Finally, we considered numerous popular examples of the practical application of pandasql and, in each case, compared the code with its pandas counterpart in each case.

Now you have everything you need to apply SQL for pandas in real-world projects. A great place for your practice is the DataLab, DataCamp’s AI-enabled data notebook with great SQL support.

Source:
https://www.datacamp.com/tutorial/how-to-use-sql-in-pandas-using-pandasql-queries