Indexed View for Aggregating Metrics

Tutorials

Azure

Microsoft Azure SQL is a robust, fully managed database platform designed for high-performance querying, relational data storage, and analytics. For a typical web application with a backend, it is a good choice when we want to consider a managed database that can scale both vertically and horizontally.

An application software generates user metrics on a daily basis, which can be used for reports or analytics. Azure SQL is a great choice to consider for storing and querying this data under certain conditions:

The analytical queries require joins with other tables (applying filters on UI)
You want to combine historical and transactional data
The data volume is not extremely large, and query performance can be managed by tuning

Let’s consider an example of a hotel booking site running Azure SQL in the backend. We want to see a UI dashboard for tracking user activity, such as clicks on the site, visits to the hotel description page, bookings made, etc.

Let’s assume all this telemetry data is dumped for each user on a daily basis in unstructured storage, and we are pulling this data into our database using background jobs, such as Apache Airflow.

Below is the schema for users table and a table to store daily metrics.

MS SQL

CREATE TABLE users (

    id INT AUTO_INCREMENT PRIMARY KEY,

    first_name VARCHAR(100) NOT NULL,       

    last_name VARCHAR(100) NOT NULL,        

    email VARCHAR(255) UNIQUE NOT NULL,     

    password_hash VARCHAR(255) NOT NULL, 

    age INT ,               

    city VARCHAR(100),                      

    country VARCHAR(100),                  

    currency VARCHAR(10),   

    last_login VARCHAR,

    hotel_preference VARCHAR(100)

);

​

CREATE TABLE daily_user_metrics (

    id BIGINT IDENTITY PRIMARY KEY,       -- Unique identifier for the record

    user_id BIGINT NOT NULL,              -- Foreign key to the users table

    clicks INT DEFAULT 0,                 -- Number of total site clicks

    visits INT DEFAULT 0,                 -- Number of visits to the hotel page

    bookings INT DEFAULT 0,               -- Number of bookings

    reviews INT DEFAULT 0,            -- Number of reviews

    cancellations INT DEFAULT 0,          -- Number of cancellations

​

    date_created DATE,            -- Daily metrics for each user

);

​

You can draw many insights from the above tables.

Let’s consider one particular example. We need to aggregate daily activity metrics grouped by date in descending order for customers between the ages of 30 and 40 located in New York City. Below is the query:

MS SQL

SELECT

    date_created,

    SUM(clicks) AS total_clicks,

    SUM(visits) AS total_visits,

    SUM(bookings) AS total_bookings,

    SUM(reviews) AS total_reviews,

    SUM(cancellations) AS total_cancellations,

FROM

    daily_user_metrics m

INNER JOIN users u on m.user_id = u.id

WHERE

   u.age BETWEEN 30 and 40

   and u.city ='New York'   

   and m.date_created BETWEEN :startDate and :endDate

GROUP BY

    date_created 

ORDER BY

    date_created DESC

Now, we can analyze the daily trends from this group of users who are in New York and between the ages of 30 and 40. The table is mostly performant, and we are able to easily perform range queries spread across multiple months. Eventually, our requirements grow. We now want to track user behavior in a weekly or monthly range. But our table stores the data on a daily basis. We now have two options:

Query the table and group the date_created weekly or monthly, depending on the ask.
Create a couple of views that aggregate the data on a weekly or monthly basis per user. See the query below:

MS SQL

CREATE VIEW weekly_user_metrics AS

SELECT

    DATEADD(DAY, -(DATEPART(WEEKDAY, date) - 1), date) AS week_start, -- Start of the week (Sunday)

    SUM(clicks) AS total_clicks,

    SUM(visits) AS total_visits,

    SUM(bookings) AS total_bookings,

    SUM(reviews) AS total_reviews,

    SUM(cancellations) AS total_cancellations,

FROM

    daily_user_metrics m

INNER JOIN users u on m.user_id = u.id

WHERE

   u.age BETWEEN 30 and 40

   and u.city ='New York'   

   and m.date_created BETWEEN :startDate and :endDate    

GROUP BY

    DATEADD(DAY, -(DATEPART(WEEKDAY, date) - 1), date) -- Group by week start

ORDER BY

    DATEADD(DAY, -(DATEPART(WEEKDAY, date) - 1), date) DESC -- Sort by latest week

However, one important thing to consider is that views just provide an abstraction to the underlying query which simply queries the underlying table. Materialized Views are the next thought that comes to mind. However, they need to be refreshed manually or on a schedule, due to which real-time data is not available.

To address these issues, Azure SQL Server offers a great feature known as Indexed View. An Indexed View is a physical representation of a view stored in the database with a unique clustered index. Changes to the underlying tables automatically update the indexed view to keep it in sync. It uses a clustered index that organizes the data in the view based on the order of the index keys.

The indexed view is ideal for scenarios where we need real-time data, and our query involves complex multi-table joins. It is also suitable for our use case where existing data is rarely updated but queried often, and we have range-based queries and want to do ordered retrieval.

There are some things to consider before deciding whether you want to go for indexed views. Indexed views cannot have non-deterministic functions. A non–deterministic function is a function that does not always return the same result for the same input, even when executed with identical arguments and under the same database conditions. Also, an indexed view is an actual structure that requires storage, similar to a Materialized View.

The syntax for creating an Indexed View is similar to the View creation query above. However, we cannot have non-deterministic functions while creating an indexed view. The line DATEADD(DAY, -(DATEPART(WEEKDAY, date) - 1), date) AS week_start in the view query above depends on the session-specific SET DATEFIRST setting, which determines the first day of the week. This is considered non-deterministic as it will produce different results for different conditions.

Keeping the above things in mind, we can proceed to eliminate the non-deterministic computation by making the column deterministic. We add a week_start column to the underlying table and precompute and fill the week_start value in the table for daily data pulls. So, the rows with dates D1 through D7 belong to W1, D8 to D14 belong to W2, and so on.

Now, we can proceed to create an indexed view with the SQL below.

MS SQL

ALTER TABLE daily_user_metrics ADD week_start DATE;-- Populate this column with first day of the week going forward

​

CREATE VIEW dbo.weekly_user_metric_aggregations_view

WITH SCHEMABINDING

AS

SELECT

    user_id,

    week_start,

    SUM(clicks) AS total_clicks,

    SUM(visits) AS total_visits,

    SUM(bookings) AS total_bookings,

    SUM(reviews) AS total_reviews,

    SUM(cancellations) AS total_cancellations,

    COUNT_BIG(*) AS row_count --SQL Server requires COUNT_BIG(*) in indexed views to handle scenarios where the count exceeds the range of an INT data type.

FROM

    dbo.daily_user_metrics

GROUP BY

    user_id,

    week_start;

CREATE UNIQUE CLUSTERED INDEX IX_weekly_user_metric_aggregations_view

ON dbo.weekly_user_metric_aggregations_view (user_id, week_start);

After this indexed view is created, we can query it as follows:

MS SQL

SELECT

    week_start,

    SUM(total_clicks) AS total_clicks,

    SUM(total_visits) AS total_visits,

    SUM(total_bookings) AS total_bookings,

    SUM(total_reviews) AS total_reviews,

    SUM(total_cancellations) AS total_cancellations,

FROM

    weekly_user_metric_aggregations_view mv

INNER JOIN users u on mv.user_id = u.id

WHERE

   u.age BETWEEN 30 and 40

   and u.city ='New York'   

   and m.date_created BETWEEN :startDate and :endDate

GROUP BY

    week_created 

ORDER BY

    week_created DESC

Conclusion

An indexed view will have a significantly faster query time than a basic view or querying the table directly for large datasets where data has grown up to a million rows or more. The end user will have a low latency experience, and queries to the database will be optimized. Since we aggregated a week’s worth of data in 1 row, we have cut the aggregation time taken to 1/7th.

Source:
https://dzone.com/articles/indexed-view-for-aggregating-metrics