Indexed View for Aggregating Metrics

Microsoft Azure SQL is a robust, fully managed database platform designed for high-performance querying, relational data storage, and analytics. For a typical web application with a backend, it is a good choice when we want to consider a managed database that can scale both vertically and horizontally.

An application software generates user metrics on a daily basis, which can be used for reports or analytics. Azure SQL is a great choice to consider for storing and querying this data under certain conditions:

  1. The analytical queries require joins with other tables (applying filters on UI)
  2. You want to combine historical and transactional data
  3. The data volume is not extremely large, and query performance can be managed by tuning

Let’s consider an example of a hotel booking site running Azure SQL in the backend. We want to see a UI dashboard for tracking user activity, such as clicks on the site, visits to the hotel description page, bookings made, etc.

Let’s assume all this telemetry data is dumped for each user on a daily basis in unstructured storage, and we are pulling this data into our database using background jobs, such as Apache Airflow.

Below is the schema for users table and a table to store daily metrics.

MS SQL

 

You can draw many insights from the above tables.

Let’s consider one particular example. We need to aggregate daily activity metrics grouped by date in descending order for customers between the ages of 30 and 40 located in New York City. Below is the query:

MS SQL

 

Now, we can analyze the daily trends from this group of users who are in New York and between the ages of 30 and 40. The table is mostly performant, and we are able to easily perform range queries spread across multiple months. Eventually, our requirements grow. We now want to track user behavior in a weekly or monthly range. But our table stores the data on a daily basis. We now have two options:

  • Query the table and group the date_created weekly or monthly, depending on the ask.
  •  Create a couple of views that aggregate the data on a weekly or monthly basis per user. See the query below:
MS SQL

 

However, one important thing to consider is that views just provide an abstraction to the underlying query which simply queries the underlying table. Materialized Views are the next thought that comes to mind. However, they need to be refreshed manually or on a schedule, due to which real-time data is not available.

To address these issues, Azure SQL Server offers a great feature known as Indexed View. An Indexed View is a physical representation of a view stored in the database with a unique clustered index. Changes to the underlying tables automatically update the indexed view to keep it in sync. It uses a clustered index that organizes the data in the view based on the order of the index keys.

The indexed view is ideal for scenarios where we need real-time data, and our query involves complex multi-table joins. It is also suitable for our use case where existing data is rarely updated but queried often, and we have range-based queries and want to do ordered retrieval.

There are some things to consider before deciding whether you want to go for indexed views. Indexed views cannot have non-deterministic functions. A nondeterministic function is a function that does not always return the same result for the same input, even when executed with identical arguments and under the same database conditions. Also, an indexed view is an actual structure that requires storage, similar to a Materialized View.

The syntax for creating an Indexed View is similar to the View creation query above. However, we cannot have non-deterministic functions while creating an indexed view. The line DATEADD(DAY, -(DATEPART(WEEKDAY, date) - 1), date) AS week_start in the view query above depends on the session-specific SET DATEFIRST setting, which determines the first day of the week. This is considered non-deterministic as it will produce different results for different conditions.

Keeping the above things in mind, we can proceed to eliminate the non-deterministic computation by making the column deterministic. We add a week_start column to the underlying table and precompute and fill the week_start value in the table for daily data pulls. So, the rows with dates D1 through D7 belong to W1, D8 to D14 belong to W2, and so on. 

Now, we can proceed to create an indexed view with the SQL below.

MS SQL

 

After this indexed view is created, we can query it as follows:

MS SQL

 

Conclusion

An indexed view will have a significantly faster query time than a basic view or querying the table directly for large datasets where data has grown up to a million rows or more. The end user will have a low latency experience, and queries to the database will be optimized. Since we aggregated a week’s worth of data in 1 row, we have cut the aggregation time taken to 1/7th.

Source:
https://dzone.com/articles/indexed-view-for-aggregating-metrics