T-SQL Tuesday Retrospective #015: Aggregations

In early 2011 Jes Borland invited us to write about aggregations:

I want to hear how you solved business problems with aggregate functions. I want to see your cool T-SQL tricks. How are aggregates used in SSRS, SSAS, or SSIS? What have you learned about aggregate functions?

Ten years of hindsight (and being able to read the wrap-up post with all the responses) gives me an advantage in this retrospective, I admit, but I didn’t find the thing I was going to write about anyway even though one or two people had a similar idea. And that, dear reader, means that I can write about one of my favourite performance secret weapons: the indexed view. It’s essentially a regular view with an index (or indexes) attached to it. Oracle calls them materialized views. Unlike a regular view which is simply a query definition, the indexed view persists the results, making it a lot more efficient to query that data:

Creating a unique clustered index on a view improves query performance because the view is stored in the database in the same way a table with a clustered index is stored. The query optimizer may use indexed views to speed up the query execution.

Querying the same data every time you want to write a report — especially against an OLTP (online transactional processing) workload — means the database ends up putting a lot of stuff into memory in order to run a SUM or AVG, and that’s a lot of effort if you’re just looking for the summary.

Indexed views aren’t free. Every time your underlying table (or tables) is modified, the indexed view also needs to be updated — yes, just like a regular index — and there is overhead attached to that. In other words, be selective (pun intended) in your indexed views. In the vast majority of cases, the performance overhead of maintaining indexes is offset by the reduced load on the server when you query that aggregated data.

Let’s say that you have a sales report that a lot of folks access every few minutes. You could write an aggregated table out to a reporting database once a day and query that, but as the day goes on the data becomes stale. Your colleagues want to have something up to the minute in order to do more accurate forecasting. Enter the indexed view.

In a contrived example, imagine you have ten million rows in your sales history table for the last five years, and these rows take up about 100,000 data pages in the buffer pool. That’s 100,000 x 8,192 bytes per page, or a total of 800 MB.

Now imagine you create an indexed view that aggregates that data across a few measures. Let’s say you roll up the data into a row for each day, with columns representing those measures per day. That means you only have 365 rows x 5 years (fewer than 2000 rows in total) which might possibly take up 10 or 20 data pages, or less than 200 KB. That’s an up-to-date real time aggregation that takes up only 1/400 of the resources of the original table.

To create an indexed view, you need to be sure you follow the requirements. SQL Server doesn’t let you index any old view, and it requires something called schema binding for the underlying table. In other words, for the view to persist data it must be bound to the table it relies on, which means that you can’t modify that table without dropping the indexed view first.

Non-deterministic functions and imprecise data types may affect your ability to create an indexed view. Imprecise data types include floating point values on certain CPU architecture. Non-deterministic functions would include those which return a different value each time when using the same input parameters, for example, GETDATE() or SYSDATETIME().

For the purposes of this post though — which is specifically about aggregations — this is where indexed views shine. When using a GROUP BY clause — which is implied by the idea of aggregation — you need a “count” column (COUNT_BIG(*)), which counts the rows affected in the aggregation.

Then you can go wild with SUM, AVG, GROUP BY, even the window function aggregations that were introduced in SQL Server 2012 and beyond, provided of course they are deterministic (i.e., the result is predictable based on the input parameters of those functions).

Once your view is created with schema binding enabled, you can then create a clustered index and zero or more non-clustered indexes on that view, and the database engine maintains the data inside those indexes for you whenever a value is modified in the underlying table. You now have a summarized subset of the big table which you can query.

There’s one small caveat if you use SQL Server Standard Edition, and that is you need to use the query hint WITH NOEXPAND. If you neglect this step and select from the indexed view without specifying the hint, the query optimizer may just execute the query definition and run the full query against the bigger table. You don’t want that, so always be sure you have the right hint.

Share your indexed view tips in the comments below.