Art of the DBA Rotating Header Image


#TSQL2SDAY: You’re Gonna Cluster that GUID

370801205_efe0fa8d7d_zWelcome to the continuing blog party that is T-SQL Tuesday. This month’s host is Mickey Stuewe (@SQLMickey) with a challenge to blog about data modelling mistakes we’ve seen. Unfortunately, it’s an easy topic to blog about because proper data modelling often falls by the wayside in order to rapidly deliver applications. I’m not saying it should, but the fact of the matter is many developers are so focused on delivering functionality that they do not think about the long term implications of their design decisions.

One such mistake is the selection of Globally Unique Identifiers (GUIDs) for a data type. Developers love this data type for a lot of reasons, some of them very good. The fact that GUIDs are globally unique provides a way to keep data unique when migrating along with a nicely obfuscated surrogate key that can help protect user data. All in all, I can not really blame people for wanting to use this data type, but the problem is that SQL Server does not manage that data type very well behind the scenes.

The problems with GUIDs in SQL Server are fairly well documented. As a quick review for context, the issue is that since GUIDs are random values, it is hard to efficiently index them and these will rapidly fragment. This means slower inserts and more disk space taken up by the index. I created the following two tables (one with a GUID, one with an INT) and inserted 2000 rows into each:

create table GUIDTest(
    orgid uniqueidentifier default NEWID()
    ,orgname varchar(20)

create table NonGUIDTest(
    orgid int default NEXT VALUE FOR NonGuidSeq
    ,orgname varchar(20)

The results from sys.dm_db_index_physical stats tell the story from a fragmentation and size perspective:

11-7-2015 11-37-41 AM

So GUIDs are bad. With as much as has been written on this topic, it feels a bit like shooting fish in a barrell. However, GUIDs will find their way into databases because of their usefulness to the application layer. What I want to talk about is a common misconception around “solving” the GUID problem: clustering on a sequence ID but keeping the GUID in the table.

Let’s start by creating a new table to have a GUID and cluster on a sequence:

create table SeqGUIDTest(
    seqid int default NEXT VALUE FOR GuidSeq
    ,orgid uniqueidentifier default NEWID()
    ,orgname varchar(20)

11-7-2015 11-48-28 AMAs expected, less fragmentation and size. This is good, right? It can be, but here’s the problem: the sequence is completely meaningless to our data and our queries will likely not use it (unless we build in additional surrogate abstraction to relate sequence to our GUID). Let’s compare query plans for our GUIDTest and SeqGuidTest tables where we query each for a specific orgid value:

11-7-2015 11-54-20 AM

The query where the GUID is a clustered index is far more efficient than the one against the table where we cluster on a sequence. This is because it can leverage the index, meaning we will get a seek instead of a scan. While clustering on a sequence field saves us on space and fragmentation, it ends up hurting us when trying to do data retrieval.

If we were tuning the query against SeqGuidTest, the next logical step for tuning would be to create a non-clustered index on orgid. This would improve the query, but in order to make it useful (and avoid key lookups), we would need to include all the columns of the table. With that, we have completely negated any benefit we got from clustering on a sequence column, because the situation is now:

  • A clustered index that is the table, thus using all that disk space.
  • A non-clustered index that also uses as much space as the table (it has all the columns in it as well).
  • The non-clustered index now has the same fragmentation problem we were trying to avoid on the clustered index.

So while trying to avoid a problem, we have made the problem worse.

There are two lessons here.  The first is the classic “it depends” case. There are no hard and fast rules to how we implement our databases. You need to be flexible in creating your design and understand the complete impact of your choices. Clustering on a GUID is not great, but in the above case it is far better than the alternative.

The second lesson is to understand your data and how it is going to be used. When we build tables and indexes, we need to be conscious of how our data will be queried and design appropriately. If we use abstractions like surrogate keys, then we need to make sure the design is built around that abstraction. In the above examples, we could cluster on a sequence key, but only if that sequence key has some sort of meaning in our larger database design. Building a better database is about understanding our relationships and appropriately modeling around them.

Thanks to Mickey for the great T-SQL Tuesday topic! Please check her blog for other great posts throughout the day.

Database Oil Changes – Part 2

She’s comin’ apart at the seams!

The other big maintenance task that should be regularly done on your databases is managing your index fragmentation. For those wondering, index fragmentation is how the page splits in your indexes will cause the data pages that make up your indexes to become logically disorganized, so that when the SQL Server engine goes to use the indexes, it will have to make more calls to the physical disk to get the appropriate index pages, thus requiring more work (ultimately meaning your queries will perform slower).

For the full skinny on what index fragmentation is, take an hour and watch Paul Randal’s(b|t) MCM training video.

In order to address index fragmentation, you have a two options:

  • REORGANIZE – This will reorganize the data pages in your index on the leaf level. While it will compact and organize data on that leaf level, header records won’t be adjusted, meaning you might still have index fragmentation.
  • REBUILD – This is exactly what it says it is, a complete rebuild of the index. This is pretty time and resource intensive (especially if your table is large), but once it’s done you will have a clean and shiny new index with little or no fragmentation.

The general rule of thumb on these two options is to REORGANIZE when your fragmentation is low, REBUILD when it gets higher. The numbers typically used are to REORGANIZE when your fragmentation is between 10% and 30%, REBUILD when you exceed 30%, but your mileage may vary.  As with many aspects of SQL Server, there are no real hard and fast rules, so make sure you evaluate your environment when deciding between these two options.

A Level 3 Diagnostic

So how do we find fragmentation? Fortunately, there’s a dynamic management object just for this purpose: sys.dm_db_index_physical_stats.   CAUTION: This is one of the few super-costly DMOs out there, as it will scan all your index pages to find out the exact fragmentation. I recommend you read up on the different options before using it and do not run it on production servers during high activity times.

To get you started with this view, here’s a basic query to get your fragmented indexes in the current database:

SELECT AS databaseName
    , ps.OBJECT_ID AS objectID
    , object_name(ps.OBJECT_ID,db.database_id) objectName
    , ps.index_id indexID
    , ps.partition_number partitionNumber
    , ps.avg_fragmentation_in_percent fragmentation
    , ps.page_count/128.0 index_size
FROM sys.databases db
  JOIN sys.dm_db_index_physical_stats (db_id(),NULL, NULL , NULL, N'Limited') ps
      ON db.database_id = ps.database_id
  left join sys.indexes idx on (ps.object_id = idx.object_id and ps.index_id = idx.index_id)
WHERE ps.index_id > 0 --Skip heap tables
	and ps.avg_fragmentation_in_percent >= 30 --Fragmentation >= 30%
order by fragmentation desc
OPTION (MaxDop 1);

Make it so

So armed with our tools, now we need to set up a regular job to find and defrag our indexes. The frequency with which you do this is completely dependent on your environment, so it might be once a day, once a week, or even once a month. The key, as with the statistics maintenance, is to do it in relation to how frequently your data changes. I try to rebuild my indexes daily (well, only the ones that need it) in my OLTP systems. In a data warehouse, I’ll set up my job to run whenever a major ETL load completes, as that will be just after my data has changed significantly.

And, because the SQL Server community is so awesome, you don’t even have to write your own script! Go take a look at Michelle Ufford’s(b|t) Index Defrag script. It covers many of the possible scenarios for index rebuilding and, with all its parameters, it’s very configurable. You can easily take this stored procedure, implement it into a SQL Agent job, and *BAM*, instant maintenance!

What this will do, as with statistics maintenance, is keep your data in the best shape possible for querying. By reducing the fragmentation in your indexes, you reduce the I/O hit you’ll take whenever you query against those indexes and have one less thing to worry about when troubleshooting performance problems.

One more post to go! We’ll wrap up with me talking(well, writing, really) a little bit about transaction log backups and then your overall general maintenance plan.