Saturday, January 27, 2024

What is Window Functions in SQL? Rank, Row_Number, and Dense_Rank? How to use it and Performance Analysis

T-SQL (Transact-SQL) window functions are a category of functions in Microsoft SQL Server that perform calculations across a specified range of rows related to the current row within a result set. These functions operate on a set of rows that are defined by an OVER clause in the query, allowing for more advanced and flexible analytical processing compared to traditional aggregate functions.

Window functions are powerful for analytical and reporting tasks where you need to perform calculations over a specific range of rows rather than the entire result set. The OVER clause defines the window, specifying the partition, ordering, and framing of the rows involved in the calculation. These functions are widely used in scenarios such as ranking, pagination, and trend analysis in SQL queries.

Here is a list of commonly used T-SQL window functions:

ROW_NUMBER(): Assigns a unique sequential integer to each row within a partition of the result set.

RANK(): Assigns a rank to each distinct row within a partition of the result set, with ties receiving the same rank and leaving gaps in the ranking sequence.

DENSE_RANK(): Similar to RANK(), but without gaps in the ranking sequence. Ties receive the same rank, and the next rank is not skipped.

NTILE(n): Divides the result set into "n" number of roughly equal parts, assigning each row a bucket number.

LEAD() and LAG(): Access data from subsequent or preceding rows within the result set, respectively. These functions are useful for comparing values across different rows.

SUM(), AVG(), MIN(), MAX() OVER(): These functions allow you to calculate cumulative or running aggregates over a specified window of rows.

FIRST_VALUE() and LAST_VALUE(): Retrieve the first or last value within a window of rows, respectively.

PERCENTILE_CONT() and PERCENTILE_DISC(): Calculate the percentile value within a specified range of rows.


T-SQL Window functions, first introduced in 2005 and improved in 2012, are fantastic improvements to the T-SQL language. We will outline everything you need to know to obtain good results while utilizing these functions in this post.

What is Window function in SQL? Rank, Row_Number, and Dense_Rank?



The OVER clause and sorting

PARTITION BY and ORDER BY are two choices in the OVER clause that might induce sorting. All window functions allow PARTITION BY. However, it's not required. For the majority of the functions, ORDER BY is necessary. 

The data will be sorted based on the OVER clause, which might be the query's speed bottleneck, depending on what you attempt to do.

The OVER clause's Sequence BY option is necessary so that the database engine can "align up" the rows so that the function may be applied in the right order. Let's imagine you wish to use the ROW NUMBER function in the order of SalesOrderID.

If you want the function to be applied in decreasing the order of TotalDue, the results will be different.

Let's see what the T-SQL query would look like in this scenario : 

USE AdventureWorks<version>; 
GO
SELECT SalesOrderID, 
	TotalDue, 
	ROW_NUMBER() OVER(ORDER BY SalesOrderID) AS RowNum
FROM Sales.SalesOrderHeader;
 
SELECT SalesOrderID, 
	TotalDue, 
	ROW_NUMBER() OVER(ORDER BY TotalDue DESC) AS RowNum
FROM Sales.SalesOrderHeader;


The results from the above queries would look like the below : 

What are T-SQL window functions? How to use it and Pefromance Analysis



how to use window function in SQL



No sorting is required because the first query uses the cluster key as the ORDER BY option.





The second query includes a time-consuming sort procedure.




The ORDER BY phrase in the OVER clause is unrelated to the ORDER BY clause in the overall query, which may be quite different.


The only method to avoid sorting's speed hit is constructing a dedicated index for the OVER clause. Itzik's book Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions explains how to use window functions in high-performance T-SQL. The POC index is recommended by Ben-Gan. (P)PARTITION BY, (O)ORDER BY, and (c)covering are the acronyms for (P)PARTITION BY, (O)ORDER BY, and (c)covering. 

He suggests placing any filtering columns in the key before the PARTITION BY and ORDER BY columns. Then, as included columns, add any extra columns required to construct a covering index. You'll need to test how such an index affects your query and total workload, just like everything else.

Of course, you won't be able to add an index to every query you create, but if the efficiency of a query that employs a window function is critical, you may follow this suggestion.

Here's a list of indexes that will help you enhance your previous query:

CREATE NONCLUSTERED INDEX test ON Sales.SalesOrderHeader
(CustomerID, SalesOrderID) 
INCLUDE (TotalDue);


When you restart the query, you'll see that the sort operation is no longer included in the execution plan:





FRAMING


Framing is, in my opinion, the hardest topic to grasp while studying T-SQL window methods. See the introduction to T-SQL window functions for further information on the syntax. The following items require framing:

ORDER BY window aggregates, such as FIRST VALUE LAST VALUE, are used for running totals or moving averages.

Fortunately, framing isn't necessary most of the time, but it's all too simple to skip the frame and go with the default. RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is always the default frame. Speed will suffer while you will obtain the right results as long as the ORDER BY option contains a unique column or collection of columns.

Here's an example of a comparison between the default and proper frames:

SET STATISTICS IO ON;
GO
SELECT CustomerID, 
	SalesOrderID, 
	TotalDue, 
	SUM(TotalDue) OVER(PARTITION BY CustomerID ORDER BY SalesOrderID)
          AS RunningTotal 
FROM Sales.SalesOrderHeader;
 
SELECT CustomerID, 
	SalesOrderID, 
	TotalDue, 
	SUM(TotalDue) OVER(PARTITION BY CustomerID ORDER BY SalesOrderID
	   ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
           AS RunningTotal 
FROM Sales.SalesOrderHeader;


Do let me know the output of the above query. You guys can compare that in the comment section and let me know.

The outcomes are the same, but the execution is vastly different. Unfortunately, the implementation plan does not give you the truth in this scenario. Each query consumed 50% of the resources, according to the report.

The I/O operations on the first query will be in thousands, while in the second query, they would be 0! Yes, you heard that right!


Conclusion


T-SQL window functions have been touted as high-performance options. They make creating inquiries easier in my perspective, but you must understand them well to achieve high performance. Although indexing can help, you can't generate an index for every query you write. Framing is a difficult concept to grasp, but it is critical if you need to scale up to enormous tables.

No comments:

Post a Comment