The ongoing saga of a slow tempdb on a fast server

Last week, I wrote about what should have been a fast system but was performing very poorly, weighing heavily on tempdb I/O performance.

Our original changes did not make much of a difference, but now we could rule out errant file growth. Our next step was to investigate how tempdb was being used by the application. There was one obvious culprit.

The warehouse frequently* runs a query to find out how much stock* they have on hand. This is a query that has to be almost instantaneous. In our best case scenario, it runs in seven to ten seconds, but it should be sub-second.

Did you notice the asterisks? Those are red flags: a big query, running frequently, and spilling to disk (which means spilling to tempdb).

Our first rule of performance tuning is to keep everything in memory. Disk is two to three orders of magnitude slower than memory. We need to make this query a thousand times faster.

The query itself is based on a view. Nothing wrong with that in principle, assuming you keep your table joins to a minimum (a rule of thumb is fewer than ten). I found 25 tables, two of which were derived tables.

After rewriting the two derived tables as common table expressions (CTEs) at the top of the query, it was much easier to see what was happening.

Asking For Too Much

The first derived table was a partition query using ROW_NUMBER to find records with the latest date by product type, and then fetching the first row. It makes use of a sort operator, which is a blocking operation. In many cases, sorts will spill to tempdb.

The second derived query was the one doing actual stock counts. It used a GROUP BY, which also requires a sort operation, and therefore potentially spilling to tempdb.

Twenty three other tables were joining in some way or another (mostly LEFT JOINs, some INNER JOINs) to the derived tables. Each of these joins was represented in the query plan as a hash join, which also spills to tempdb.

(Hash joins aren’t bad in and of themselves. They are an efficient way to join large rowsets, but they often force the data to spill to tempdb.)

Every time this query runs, the second derived table always reads 8 million rows, or 360MB worth of data, before joining to the first derived table.

Predicate Pushdown

Because both derived tables are only looking for a single row of data, based on the product being counted (the predicate), what we want is for the the predicate to be pushed down into both derived tables, right at the start of the query, so that any joins done on the tables are performed on a single row of data.

This is called predicate pushdown. In our case, the query optimizer could not find the most efficient query plan, and was timing out, so the second derived table never saw the predicate until the end.

Limiting the derived tables to one row also changes the joins in the plan from hash joins to nested loop joins, avoiding tempdb altogether, because nested loops are far more efficient for small rowsets.

The Fix

There is an extremely simple way to get the predicate pushed down to the derived tables, and that is to physically put it there. Unfortunately we cannot specify a value for the product in the view definition.

A stored procedure with a single @ProductID parameter would allow us to add WHERE ProductID = @ProductID to both derived tables, which would make the query really fast. In my testing, it ran in under 100ms.

The problem is that this would require numerous code changes to the existing system. Many of the queries that rely on the view also use additional JOINs and WHERE conditions, meaning that a stored procedure is impractical.

Enter the table-valued user-defined function, or TVF.

I copied the exact SELECT statement from the view definition, making two small changes: I added WHERE ProductID = @ProductID to both derived tables.

Then, using the output of the view as the basis of my table definition, I populated the top of the table-valued function with that.

So now, whenever we query the TVF, it builds a table definition in memory, runs the original query with its 20-odd joins, but limits the data to a single row, which is then returned in the same way if querying the view.

Outcome

  • Original query with a single ProductID: 46,636 logical reads, 10 seconds.
  • New query with a single ProductID: 35 logical reads, 109 milliseconds.
  • One thousand times faster: check.

98% of I/O is taken up by tempdb

During a recent engagement, the customer explained that the performance on their relatively small system was especially slow on Monday morning, but generally bad that entire day.

The SQL Server 2008 R2 instance, which runs in a virtual machine, has 30GB of RAM allocated, but the SQL Server memory usage never exceeds 16GB.

Digging deeper, we discovered that 98% of all I/O was consumed by tempdb, described by Brent Ozar as the public toilet of SQL Server.

The system makes extensive use of temporary tables and table variables in its queries.

The tempdb data file hovers around 2.5GB, and the log file is around 1.5GB. Certainly not a huge database, and puzzling why this small company is experiencing performance issues.

It all fits in memory! Which is fast!

We noticed that the SQL Server instance had been restarted the previous Sunday, as part of a maintenance cycle, and that the file allocation for tempdb was set to the defaults:

  • 8MB for data file, 10% growth
  • 1MB for log file, 10% growth

In other words, no matter how big the tempdb database was before the restart, it would always shrink down to that tiny size, needing to zero out the log file when it grew again.

To get to the size it was today, tempdb had to grow the data file 60 times and the log file 77 times. Each time it took 10% longer than before to grow, requiring 10% more space.

Of note is that SQL Server does not have any global trace flags enabled, in particular, TF1118.

Every half hour, a virtual machine snapshot is taken, freezing the I/O on the server as well.

Our resolution was as follows:

  • Set Windows Power Saving Mode to High Performance
  • Increase tempdb data files to match number of CPU cores
  • Change tempdb default file sizes to the current usage
  • Change tempdb growth to fixed value, appropriate for I/O subsystem
  • Shrink and resize tempdb log to current size
  • Enable Trace Flags 1118 and 1117 to improve data file contention
  • Switch from VM snapshots to Full Recovery Mode and enable transaction log backups

We will have to wait to find out what impact these changes had on the environment over the course of a week. I’ll report back next time.

Professionalism doesn’t mean a collared shirt and tie

Working from home, consulting with companies all over the world, has changed how I interact with customers. The last time I was physically on site was seven months ago.

We deal almost exclusively with each other via conference call and video using Skype, LogMeIn or GoToMeeting, juggling webcams, headphones, microphones, email, text messages, phone calls, instant messaging, and so on and so forth …

Scott Hanselman wrote on Twitter recently about spending more than 20 minutes of a one-hour meeting getting microphones working for all meeting attendees, and this is in 2016!

Being professional means treating your customers and colleagues with the respect you think you deserve in return.

Put another way, if you treat other people with contempt, you can’t expect to be taken seriously.

Missing meetings, not having your equipment set up correctly, not wearing camera-friendly clothing (or any clothing at all!), having an inappropriate backdrop, or having an inappropriate desktop background if you’re sharing your screen, all amount to contempt.

Take the time to set up your work space correctly by keeping the webcam-visible area behind you friendly to anyone watching you on video.

Learn how to use your webcam or microphone or headphones correctly. If you have to share your computer screen, make sure you have turned off notifications. Even better, try to keep to one virtual desktop away from email, web browsers and social media.

Do you use a Mac? Did you know that there’s a way for you to set up your microphone to send clear and crisp audio through Skype or other tools? It’s called Loopback.

All that money you’re saving on gas? Buy a decent condenser microphone, over-ear headphones, and a high-definition webcam. Don’t rely on your laptop’s built-in speakers. You know what microphone feedback sounds like, and wearing headphones is a great way to avoid it.

Don’t pick your nose. Don’t get too close to the camera. Someone might have you on a giant television screen with lots of people in the room. Because you’re not physically in the room, perception is everything. Even I make some of these mistakes, which means I’m also guilty of behaving in an unprofessional manner.

This post is not only to let you know how to behave, but to remind me how I should behave. We’re in this together.

Max Server Memory Matrix

When we perform diagnostics on an existing SQL Server instance, or if we are setting up a new machine, it’s very important to make sure that we change the Max Server Memory setting to an appropriate level.

If we don’t, then there’s a good chance (especially on Enterprise Edition) that SQL Server will use all the available memory on the server. Unfortunately, this means the operating system can be starved of RAM, which creates a performance bottleneck.

Additionally, other components (including other applications, services, and hardware on the machine) may have specific memory requirements, which need to be taken into account.

To get a baseline, I use a formula developed by Jonathan Kehayias (blog | twitter) which he wrote in his book, Troubleshooting SQL Server: A Guide for the Accidental DBA, and repeated in his blog post, “How much memory does my SQL Server actually need?”.

After many months of confusing myself with mental arithmetic, I decided to create a straightforward list based on Jonathan’s algorithm, in 4 GB increments, for my own reference. I would like to share this with you today.

You can bask in the glory that is my Max Server Memory Matrix at https://bornsql.ca/memory/.

Next time you (or I) need to check if the Max Server Memory setting is too high, simply open that link and check the matrix. Over time I will make it look prettier, but for now it is already helping one of my clients.

Let me know what you think on Twitter, with @bornsql.