How to Use SQL in pandas Using pandasql Queries

SQL, or Structured Query Language, is a programming language used to access, extract, wrangle, and explore data stored in relational databases. pandas is a Python open-source library specifically designed for data manipulation and analysis.

In this tutorial, we’re going to discuss when and how we can (and when we cannot) use SQL functionality in the framework of pandas. In addition, we’ll take a look at various examples of implementing this approach and compare the results with the equivalent code in pure pandas.

Why Use SQL in pandas?

Given the definitions in the introduction, why should one want to use SQL combined with pandas when the latter is an all-inclusive package for data analysis?

The answer is that on some occasions, especially for complex programs, SQL queries look much more straightforward and easy to read than the corresponding code in pandas. This is particularly true for those people who initially used SQL to work with data and then later learned pandas.

If you need more training on pandas, you can check out our Data Manipulation with pandas course and Pandas Tutorial: DataFrames in Python.

To see SQL readability in action, let’s suppose that we have a table (a dataframe) called penguins containing various information on penguins (and we will work with such a table later in this tutorial). To extract all the unique species of penguins who are males and who have flippers longer than 210 mm, we would need the following code in pandas:

penguins[(penguins['sex'] == 'Male') & (penguins['flipper_length_mm'] > 210)]['species'].unique()

Instead, to get the same information using SQL, we would run the following code:

SELECT DISTINCT species FROM penguins WHERE sex = 'Male' AND flipper_length_mm > 210

The second piece of code, written in SQL, looks almost like a natural English sentence and hence is much more intuitive. We can further increase its readability by spanning it over multiple lines:

SELECT DISTINCT species FROM penguins WHERE sex = 'Male' AND flipper_length_mm > 210

Now that we identified the advantages of using SQL for pandas, let’s see how we can technically combine them both.

How to Use pandasql

The pandasql Python library allows querying pandas dataframes by running SQL commands without having to connect to any SQL server. Under the hood, it uses SQLite syntax, automatically detects any pandas dataframe, and treats it as a regular SQL table.

Setting up your environment

First, we need to install pandasql:

pip install pandasql

Then, we import the required packages:

from pandasql import sqldf import pandas as pd

Above, we directly imported the sqldf function from pandasql, which is virtually the only meaningful function of the library. As its name suggests, it’s applied to query dataframes using SQL syntax. Apart from this function, pandasql comes with two simple built-in datasets that can be loaded using the self-explanatory functions load_births() and load_meat().

pandasql Syntax

The syntax of the sqldf function is very simple:

sqldf(query, env=None)

Here, query is a required parameter that takes in a SQL query as a string, and env—an optional (and rarely useful) parameter that can be either locals() or globals() and allows sqldf to access the corresponding set of variables in your Python environment.

The sqldf function returns the result of a query as a pandas dataframe.

When we can use pandasql

The pandasql library allows working with data using the Data Query Language (DQL), which is one of the subsets of SQL. In other words, with pandasql, we can run queries on the data stored in a database to retrieve the necessary information from it. In particular, we can access, extract, filter, sort, group, join, aggregate the data, and perform mathematical or logical operations on it.

When we cannot use pandasql

pandasql doesn’t allow employing any other subsets of SQL apart from DQL. This means that we can’t apply pandasql to modify (update, truncate, insert, etc.) tables or change (update, delete, or insert) the data in a table.

In addition, since this library is based on SQL syntax, we should beware of the known quirks in SQLite.

Examples of using pandasql

Now, we’ll take a more granular look at how to run SQL queries on pandas dataframes using the sqldf function of pandasql. To have some data to practice on, let’s load one of the built-in datasets of the seaborn library—penguins:

import seaborn as sns penguins = sns.load_dataset('penguins') print(penguins.head())

Output:

species island bill_length_mm bill_depth_mm flipper_length_mm \ 0 Adelie Torgersen 39.1 18.7 181.0 1 Adelie Torgersen 39.5 17.4 186.0 2 Adelie Torgersen 40.3 18.0 195.0 3 Adelie Torgersen NaN NaN NaN 4 Adelie Torgersen 36.7 19.3 193.0 body_mass_g sex 0 3750.0 Male 1 3800.0 Female 2 3250.0 Female 3 NaN NaN 4 3450.0 Female

If you need to refresh your SQL skills, our SQL Fundamentals skill track is a good reference point.

Extracting data with pandasql

print(sqldf('''SELECT species, island FROM penguins LIMIT 5'''))

Output:

species island 0 Adelie Torgersen 1 Adelie Torgersen 2 Adelie Torgersen 3 Adelie Torgersen 4 Adelie Torgersen

Above, we extracted information about the species and geography of the first five penguins from the penguins dataframe. Note that running the sqldf function returns a pandas dataframe:

print(type(sqldf('''SELECT species, island FROM penguins LIMIT 5''')))

Output:

<class 'pandas.core.frame.DataFrame'>

In pure pandas, it would be:

print(penguins[['species', 'island']].head())

Output:

species island 0 Adelie Torgersen 1 Adelie Torgersen 2 Adelie Torgersen 3 Adelie Torgersen 4 Adelie Torgersen

Another example is extracting unique values from a column:

print(sqldf('''SELECT DISTINCT species FROM penguins'''))

Output:

species 0 Adelie 1 Chinstrap 2 Gentoo

In pandas, it would be:

print(penguins['species'].unique())

Output:

['Adelie' 'Chinstrap' 'Gentoo']

Sorting data with pandasql

print(sqldf('''SELECT body_mass_g FROM penguins ORDER BY body_mass_g DESC LIMIT 5'''))

Output:

body_mass_g 0 6300.0 1 6050.0 2 6000.0 3 6000.0 4 5950.0

Above, we sorted our penguins by body mass in descending order and displayed the top five values of body mass.

In pandas, it would be:

print(penguins['body_mass_g'].sort_values(ascending=False, ignore_index=True).head())

Output:

0 6300.0 1 6050.0 2 6000.0 3 6000.0 4 5950.0 Name: body_mass_g, dtype: float64

Filtering data with pandasql

Let’s try the same example that we mentioned in the chapter Why use SQL in pandas: extracting the unique species of penguins who are males and who have flippers longer than 210 mm:

print(sqldf('''SELECT DISTINCT species FROM penguins WHERE sex = 'Male' AND flipper_length_mm > 210'''))

Output:

species 0 Chinstrap 1 Gentoo

Above, we filtered the data based on two conditions: sex = 'Male' and flipper_length_mm > 210.

The same code in pandas would look a bit more overwhelming:

print(penguins[(penguins['sex'] == 'Male') & (penguins['flipper_length_mm'] > 210)]['species'].unique())

Output:

['Chinstrap' 'Gentoo']

Grouping and aggregating data with pandasql

Now, let’s apply data grouping and aggregation to find the longest bill for each species in the dataframe:

print(sqldf('''SELECT species, MAX(bill_length_mm) FROM penguins GROUP BY species'''))

Output:

species MAX(bill_length_mm) 0 Adelie 46.0 1 Chinstrap 58.0 2 Gentoo 59.6

The same code in pandas:

print(penguins[['species', 'bill_length_mm']].groupby('species', as_index=False).max())

Output:

species bill_length_mm 0 Adelie 46.0 1 Chinstrap 58.0 2 Gentoo 59.6

Performing mathematical operations with pandasql

With pandasql, we can easily perform mathematical or logical operations on the data. Let’s imagine that we want to calculate the bill length-to-depth ratio for each penguin and display the top five values of this measurement:

print(sqldf('''SELECT bill_length_mm / bill_depth_mm AS length_to_depth FROM penguins ORDER BY length_to_depth DESC LIMIT 5'''))

Output:

length_to_depth 0 3.612676 1 3.510490 2 3.505882 3 3.492424 4 3.458599

Note that this time, we used the alias length_to_depth for the column with the ratio values. Otherwise, we would get a column with a monstrous name bill_length_mm / bill_depth_mm.

In pandas, we would need first to create a new column with the ratio values:

penguins['length_to_depth'] = penguins['bill_length_mm'] / penguins['bill_depth_mm'] print(penguins['length_to_depth'].sort_values(ascending=False, ignore_index=True).head())

Output:

0 3.612676 1 3.510490 2 3.505882 3 3.492424 4 3.458599 Name: length_to_depth, dtype: float64

Conclusion

To wrap up, in this tutorial, we explored why and when we can combine the functionality of SQL for pandas to write better, more efficient code. We discussed how to set up and use the pandasql library for this purpose and what limitations this package has. Finally, we considered numerous popular examples of the practical application of pandasql and, in each case, compared the code with its pandas counterpart in each case.

Now you have everything you need to apply SQL for pandas in real-world projects. A great place for your practice is the DataLab, DataCamp’s AI-enabled data notebook with great SQL support.

Source:
https://www.datacamp.com/tutorial/how-to-use-sql-in-pandas-using-pandasql-queries