Art of the DBA Rotating Header Image

performance

Query Plan Analysis with #PowerShell

A couple years ago I wrote a blog post on shredding query plans with XML. At the time, it was just an experiment. I investigated how to parse XML and a query plan without any real goal or purpose other than “what can I do with this?” Then I left it alone and didn’t come back to it.

At least not until recently. Fast forward a few years when I’m trying to tune a large SQL batch process that had a loop. It wasn’t a very pretty process and it worked, mostly, but it could definitely benefit from some performance love. The trick was trying to quantify the changes I made within. A common way to quantify this is to check logical reads. The lower your logical reads, the less “work” is typically done in your query. The trick was capturing all the executions within the loop and summing all the logical reads across a single batch execution.

At first I went to Plan Explorer. This is a great tool and had a lot of information, but what it was missing was the ability to sum up all my logical reads across the entire batch execution. I could look at each individual query, but to add these values up was going to be tedious and painful. Two things I hate.

At this point, I figured why not give PowerShell a shot? After all, I knew that the query plan was an XML doc and I could easily traverse that using the XML functionality built into the language. That combined with a little XQuery (which I’m terrible at, by the way) should solve my problem.

Armed with this knowledge, I charged ahead. Everything worked more or less as expected, but the one piece I missed from my previous blog post was using the XML namespace. See, you need the namespace so the XML pieces in PowerShell know what to query. I floundered with this for a bit until I found Jonathan Kehayias(@SqlPoolBoy) post on sanitizing query plans.

Once you have the namespace set, the rest becomes easy. To go with my previous example, the following statements allowed me to sum and compare logical reads across all statements executed in the batch:

Once the pattern is down, the use is pretty straightforward. There’s also more options accessible to you. If we just look at the RunTimeCountersPerThread node, we can compare other values such as Rows, Scans, and CPU time. We could really get crazy and extract all the different statements within the batch. There are numerous possibilities for analysis and review.

I’m not here to tell you that you should start using PowerShell to automate query tuning. Query performance is an art form and requires a lot of case-by-case analysis. However, like any great carpenter, it’s good to know the capabilities of your tool set. Understanding the options available to you not only helps you be more effective, but can also provide answers you may not have had access to.

A Heap of Trouble

A couple weeks ago I came across an interesting flaw in SQL Server partitioning which was causing one of my tables to balloon in size.  Essentially, we found that this large table in one of our databases had a lot of empty pages in it, pages that usually got cleaned up by the SQL Server engine.  Unfortunately these pages were not getting removed and this 2 billion row table was about triple the size it should have been.

I won’t focus on the table to much, but there were two elements that created this problem: the table was a heap and the partition key was volatile (got updated frequently).  I’m not yet sure if it’s a case of forward pointers or what other mechanic within SQL Server is behind this, but the result is that when we start moving data in a heap across partitions by changing the partition key, pages the data occupied were left behind and empty in the old partition.

To test it, I built a basic heap table partitioned on an integer, then stuffed it with 1000 rows.  I then queried some of the DMVs to see what the layout of the data was:

Partition 2 has 1000 rows in it, occupying 150 pages, exactly as we would expect.  Average page use is ~83.74%, so fairly full.  All in all, not unsurprising.  What happens, though, if we change the partition key value on those 1000 rows and move it to partition 3?

The 150 pages that the data used in partition 2 are still there!  And we now also are using 143 pages in partition 3!  The result of all this is that even though we’re not duplicating data, we’re duplicating space.  Normally, SQL Server will detect these empty pages and remove them, but in this case they are not identified as empty.

What would happen if, instead of updating our data, what happens if we simply delete the rows?

Even though we removed all our data from partition 2, those pages are still out there occupying space(note the avg_page_space_used_in_percent field).  Obviously, this can quickly spiral out of control and become a major issue.

I did go through all these tests with a clustered index on the table and the results were much more positive.  When I moved data from one partition to the next or deleted rows, the pages the data occupied were removed as expected, keeping the total size of the table down.

There are a couple lessons learned from this:

  1. Heaps are trouble!  Based on the evidence we have here, it’s fairly evident that using a heap table causes all sorts of problems with partitioning.  There are reasons the best practice of using clustered indexes in SQL server exists, this being one of them.
  2. Do not use a volatile partition key!  Keep in mind that when we partition, we’re managing the physical structure under our table.  In general, we want to avoid moving data around on our disk.  If we choose a partition key that changes a lot, we’re creating a situation where our data moves around unnecessarily, which can cause us other problems even with a clustered index in place.
  3. Design is vital!  The problems we had were created initially many years ago.  At the time, things looked fine and there weren’t any noticeable issues.  It wasn’t until much later, when the table had grown to several billion rows, that we had problems.  You need to take care in designing out your structures, because fixing very large tables is weeks(if not months) of coordinated work and can require significant downtime to implement.

So there it is.  I’m not sure if this is intended behavior or just an oversight.  Most of the folks I talked to regarding this had not encountered this before, which is not unsurprising.  The original table design violates several best practices.  However, I do want to make people aware of this and I plan on filing a Connect item regarding this once I’ve had a chance to bounce it off a few more SQL brains.

For those interested, the scripts to reproduce my testing are:

Process of Elimination

I never expected to talk as much about partitioning as I have so far this year. The tally is currently at six, with at least two more in the works. It’s a hot topic due to partitioning still being a mystery to a lot of folks and I’m glad I’ve been able to pull back the curtain a little bit.  However, there’s still some things not understood very well.

One of these is partition elimination. Heck, up until I really started studying partitioning, I didn’t understand it very well either. Just like many others out there, I thought that once you partitioned a table, all your queries would suddenly become faster because the engine would only look at the needed partitions. While SQL Server tries to do this, the reality of when partition elimination actually occurs is not quite as cool.

To figure out how partition elimination works, we only have to look at our query plans. Let’s crack open the demo I use in my partitioning presentation and take a look. As a quick recap, I use a basic orders table that is partitioned on an incrementing integer primary key:

Now let’s run a simple SELECT to query data out and capture the query plan that results:

select * from orders
where order_id between 11000 and 15000

Nothing complex here and nothing complex in the query plan. You’ll see the expected clustered index seek. What’s key is to look at the properties (F4) of your operator to see how your partitions are being used:

These two property values give us all the answers when it comes to partition elimination. First, Partition Count tells us that the engine only retrieved a single partition for our query. Actual Partitions gives us the partitions (in this case, partition 3) that were actually accessed. Great! In this case, by querying order_id, we were able to take advantage of our partitioning.

What if we use a different field, will SQL still try to take advantage of the partitioning? It depends on whether or not the partitioning key is used in any part of the query. We’ll go ahead and add an index on order_date (which will be partitioned aligned) and query to get the total order amount for date range:

select sum(order_amount)
from orders
where order_date between '1/1/2012' and '2/1/2012'

Now, this plan is a little more interesting, so let’s take a look at it, focusing on the two places where we hit the table:

You’ll note that the query has to use a key lookup to complete the query. This is because we have an index on order_date, but it doesn’t include the order amount, so the engine has to go back to the table using the clustered index to find our data. Let’s now look at the properties for each of these operators:

Where are the partition properties for the index seek? Since the query can’t make use of partitioning for this part of the query, the properties aren’t listed. However, the key lookup (which makes use of the clustered index and, therefore, our partitioning scheme) does show that we accessed only partition number 3.

What if we go a step further and alter our index on order_date to include order_amount, removing the need to do a key lookup in the query? Let’s take a look:

By detailing that the query accessed eight partitions (1..8), we see that the engine tried to use the partitioning.  Since the partitioning key wasn’t directly involved, it ended up seeking across the entire index anyway.

To wrap it all up, let’s look at the performance of our order_amount queries:

  • Query cost on index without include: 2.63403
  • Query cost on index with include: 0.0276018

So even without using our partitioning, including the order_amount is still much faster than having to do a key lookup.

Overall, partition elimination can be a helpful tool in the engine’s toolbox. As your data sets get larger and larger, it can improve your performance, but only if you understand how the engine uses partitions. Sometimes, however, it’s better to take a different approach.  Don’t go and rewrite all your queries simply to get your partitioning key involved, because it may not help. Always do a careful analysis of what’s going on under the covers before making any tuning decisions.

Excuse me sir, are you using that partition?

So last week we had to puzzle out a little weirdness at work. One of our development teams is working to implement partitioning and they came to us about their scheme. They had applied a partitioning scheme, but it wasn’t being used by their queries. Their scheme was pretty simple, but the oddity was that they had partitioned on a nvarchar(10) field. Best practice is to use an integer, though I’ve seen date fields used as well. Though I knew that you could partition on a character field, I usually avoided it.

Now, using an nvarchar field shouldn’t have altered the query process that much, right? This is where it got odd, because what we were seeing was that a query on that partition scheme was still querying across all partitions. Here’s the query and what we were seeing in the query plan:

One of the advantages of partitioning is to improve query performance with range scans, so that the engine will only query across partitions. However, even though our query was specifically using the partition key, the engine was still querying across all partitions for its result. And we were specifically using a nvarchar value for the query criteria, so it shouldn’t be doing any sort of conversion, right? Well, that’s where you (and we, as well) would be wrong. We tried the query again, this time with an explicit conversion to nvarchar(10):

Bang! By converting it to the same datatype as the partition key, the query engine can now properly take advantage of our partitioning. The results are further reinforced by looking at the query costs of our two queries:

  • Query 1: 0.0131291
  • Query 2: 0.0032831

What just happened?

What’s going on behind the scenes is that in query 1 our predicate datatype does not match the datatype of our partition key. The SQL Server engine sees the query predicate of CUSTOMERID=N’2’ as an nvarchar(4000), not as an nvarchar(10). Just take away the “N” in front of our string value and have it declared as a character string and force an implicit conversion. Then, if you look at the query operator, you’ll see SQL Server do a CONVERT_IMPLICIT to nvarchar(4000), not a nvarchar(10).

I also went through this exercise using varchar as my datatype and got the same results. If I simply declared the predicate as CUSTOMERID=’2’, my query would seek across all partitions. If I used CUSTOMERID=convert(varchar(10),’2’), then I would only have the one active partition.

I’m not sure if this is a bug or working as intended, but there’s definitely a couple of lessons we can take from this:

  1. To best take advantage of querying across partitions, you need to make sure your query criteria is explicitly the same datatype as your partition key. Otherwise, the engine can’t make use of your partitioning.
  2. Obviously character strings have some weirdness about them for partitioning, so it’s best to avoid them for partition keys.
  3. Mike is weird for geeking out about this stuff. Seriously. My co-workers gave me some odd looks as we were puzzling this out.

 

 

Standing on the Shoulders of Giants

When I went to SQL Rally last week, I fully expected to be learning from many of the presenters there. The excitement about being immersed in all that awesomesauce is infectious. What I didn’t expect is that I would take my turn at teaching, handing out information like candy to twelve year olds. It was one of those times where I was talking shop with some other folks while we were all taking a session off. The group started chatting about performance tuning(something I geek out about) and I shared out some of the fundamental links, blogs, and tools I use when I get the inevitable “My query is slow!” call.

Now, just to be clear, none of the information I shared was “mine”. The beauty of the SQL Server community is that we have so much great talent and everyone is willing to help educate. What follows here is a list of my core tools and information links, resources that I use on almost a daily basis. I also see them as the start of a journey, because none of this is the last word on a particular subject. If you’re like me you’ll continue to browse the Google ‘verse with what you pick up here.

Brent Ozar:

Full disclosure, I’m a Brent Ozar fanboy. I started reading his blog about a year or so ago and he’s made me laugh and taught me a whole heck of a lot. It’s because of him that I started blogging and getting involved in the community.

Glenn Berry:

Glenn Berry is a local Denver MVP and I see him from time to time at the Denver SQL User’s Group meetings. Not only has he provided these great DMV queries, but he really understands hardware and has some great posts on the nitty-gritty of your machines.

General Community Resources

These two sites are invaluable resources for articles and tutorials on getting stuff done. When I was figuring out how to implement partitioning, I did a lot of reading on both of these sites.

Documents

Both of these docs live on my laptop and are constantly getting referenced. My only complaint about the waits and queues white paper is that it’s a little stale and I hope Microsoft does an updated one for Denali when it’s released.

If you don’t have a cheat sheet with your go to resources, hopefully this list gives you a good start. Or you can add this to your currently existing list. I have found each one of these links invaluable to my day to day work, not only enriching my general knowledge but giving me the tools to troubleshoot many of the performance problems that come my way.

Meme Monday – No, it’s not really the disk

Tom LaRock asked for 9 of your biggest non-disk problems for Meme Monday, so here we go:

1) Index Fragmentation
2) Memory over-consumption(stop running your apps on the SQL Server!!!!)
3) Run away queries
4) Network communication
5) ODBC/OLEDB drivers (Oracle, I’m looking at you)
6) Over-indexing (yeah, that table with 18 columns and 12 indexes….)
7) Lack of partitioning
8) Stale statistics
9) Locking (did you really have to use that exclusive table lock hint?????)

5 of these I’ve had to resolve in the last week. All of them within the last month. The good news is that most of these are pretty easy to identify and solve, but the bad news is most of these issues are the result of bad coding/querying practices and can be prevented with a little forethought. Which means I have more work to do with my developers…

Thanks for joining me on Meme Monday!

————————————

Edit: After a brief discussion with Jeff Smith(b|t), I thought I should clarify the “Lack of Partitioning”.  I know most people use this to spread the I/O load against a data file, but I found it as a HUGE boost for performance because it allowed me to segment out my table into manageable parts for maintaining indexes and improving select performance.  Take a look at my partitioning post for more info.