Art of the DBA Rotating Header Image

monitoring

The Hierarchy of Monitoring Needs

For those not aware, I’ve been selected as a presenter at this year’s PASS Summit in Charlotte, NC.  I’m deeply honored by the selection and look forward to the experience.  The topic is something I’ve been working on for sometime now:  How should administrators approach monitoring?  This seems like a simple question, but go Bingle “SQL Server Monitoring” and see what you get.  Go ahead, I’ll wait.

The results are either tools or metrics.  While these are useful, it highlights a significant gap out there: a methodology.  The closest you’ll find out there is Adam Machanic’s (@adammachanic) Ten Commandments Of SQL Server Monitoring (which applies to all monitoring, really, but I digress).  The goal with my PASS presentation is to provide a “how to” plan for creating your overall monitoring strategy.

Hierarchy of Monitoring Needs

How does this methodology work?  Several years ago I was introduced to Maslow’s Hierarchy of Needs.  It’s a structured view of what humans need, built around a tiered view where a person addresses each level before they are concerned with the next set of needs.  The template carries well, such that Brent Ozar(@brento) even wrote about his own Hierarchy of Database Needs a while back.

So what’s my approach?  To get the detail, you’ll need to come to my PASS session, but I’ll let the overview whet your appetite:

Survival – There are some factors database administrators must be concerned with.  As guardians of the data, we have to ensure three things about our systems before we can worry about anything else:

    • Reliable backups – You hear this ad nauseum, but it’s true.  Without backups, we’re crossing the high wire without a net.  Note, also, that these are “reliable”.  You need to know more than your backups are happening, but can you find them and are they usable.
    • Service Availability – Is SQL Server up?  Is it responding to queries?  While this seems like a “duh” moment, if our users can’t access their data, then we’re in almost as much trouble as we were if we can’t recover from a backup.
    • Space – This is not long range capacity planning that we will worry about later down the road, but a validation that we’re not “red lined” for space.  Because if we run out of space, we affect availability and use of our databases.

Knowledge  – In order to identify what to monitor, we need to know what to be monitoring for, otherwise we’re wandering in the mountains without a map.  We establish this by working with the business to create our Service Level Agreements.  I wrote about SLAs a year ago and my approach is still the same:

    • Availability – Users want to know their data is queryable, protected, and secure.  We need to define our customer’s expectations for high availability, disaster protection, maintenance windows, and security.
    • Performance – Users want their data fast, but we need to put definition around this.  What defines a query as to slow?  How long can things be blocked? What sort of resource usage is allowed within our environments?
    • Support – While this only partially affects our monitoring, our users want to know who’s “throat to choke” if there’s an issue.  For monitoring, that defines escalation patterns and critical/non-critical alerts.

Direction – Metrics and whether they’re to high or to low are commonly what DBAs think of when they discuss monitoring.  Are my batches/second to high?  What’s my I/O throughput?  Page Life Expectancy? Wait types? The list goes on and on.  However, like many things, DBAs want to put the cart before the horse and just start looking at stats.  By dealing with SLAs first, we’ll be better equipped to monitor what’s important to the business rather than what’s important to us.

Actualization – After we know what to look for, it’s time to decide how we look for it, and this leads us into the discussion of tools.  The problem is, there are a load of tools out there you can use.  Do you go with a specialized third product?  Maybe some larger, more generic system that integrates with other parts of the enterprise?  Do you write your own, using what’s freely available and paying the cost in time to build?  Whatever the case, having a list of metric needs equips us with the knowledge necessary to identify what tool set best suits our situation.

Understanding – One of the trickiest parts of monitoring is knowing what’s a problem and what is just normal behavior.  Once our metrics are defined and our tools are set up, we need to collect baselines and map trends to understand the health of our enterprise.  While our SLAs can give us a starting point, we need to study our environments over time.

Refinement – Once everything is in place, we can start the cycle of fine tuning our monitoring.  Systems and processes change over time.  New storage will need to be added and new applications brought on line.  With we are able to periodically review our trends and anticipate future needs.  Managing our metrics based on these changes and adapting our monitoring is an ongoing process and is never set in stone.

Just so you know, this is just the 30,000 foot view that gives administrators the basic map.   At Summit I’ll discuss the territory in a little more detail as well as some scripts, templates, and additional information to help you with your own monitoring needs.  I’m very excited to be sharing this at Summit and hope that you’ll all join me for it.

Server Inventories

Ok, I’ve been promising this to people for a while, time to get at it.  Brace yourself for a long one.

 As enterprise DBA’s, we usually have many servers to manage.  Whether it’s around ten, a hundred, or a thousand (or more!), tracking this manually is a bear.  However, there are still many shops maintain convoluted spreadsheets and other manual documentation to track their inventory.  Manual sucks.  As John Sansom(@SQLBrit) says, the best DBAs automate everything, including their server inventories.

 There’s a lot of approaches and third party tools you can use for your own environment.  In my mind, I’m totally cool with spending the money on a third party tool because of the depth they can provide in my tracking and monitoring.  A lot of times, though, we might not have access to those tools or we may need more information than those tools provide.  It’s at this point we sit down at our keyboard, crack our fingers, and start banging out some code.  Of course, the code I started banging out was Powershell (shocking, I know!).

 I didn’t start from scratch, though.  When putting my own script together (which we’ll get to later), I found two great resources that got me about 80% of the way:

 Allen White(@SQLRunr) – Let PowerShell do an Inventory of your Servers
Colleen Morrow(@ClevelandDBA) – Building A SQL Server Inventory

These are both great resources that will get you started, but I needed a little more.  The biggest challenge I had was a need to poll cluster information and collect SQL instance information separate from my machine information.  I also wanted to do some consolidation of my data sets.  So while I borrowed heavily from both Allen and Colleen, but then molded it to my own purposes.

Process

Before we dig into the code, I first want to touch on the collection process and how I handle it.  It’s not complex, but code doesn’t make sense until you understand the “why” of the different components.

What I’m collecting

  • SQL Instance information – Whether it’s an instance living on a cluster node, a physical stand alone, or a VM host, I want the SQL Instance without really caring about the machine it lives on.  At least not meshed in directly with the machine, though I DO collect the physical host name so I can identify which machine information I need.

  • Physical Machine information – Since I’m not tying the machine info directly with the SQL Instance collection, I want the physical information for the servers my instances live on.  This is where I get everything that isn’t tied directly to the SQL Instance.

How I’m collecting it

  • I’ve got 4 tables, 2 live tables and 2 stage tables, that I use for the collection.  Nothing out of the ordinary, but there’s an instance table with a stage table and a machine table with a stage table.

  • I use a Powershell script to perform the collection itself.  It uses instance table as its primary driver, where the idea is that I provide the initial population of instance names to track into the table and the inventory process will fully populate the rest of the information.

  • When the Powershell scripts complete, there is a stored procedure it executes to load data from the stage tables into the live tables.  The load is simple, where the instance table is updated with data from the stage and the machine information is deleted/replaced.

  • The Powershell script is run by a SQL Agent job running under a specific monitoring Active Directory account created as a credential in SQL Server.  The gotcha here is that the monitoring account needs domain access to the clusters and machines it’s going to be querying.

The Code

I won’t list the full scripts here, but you can download and review them here:

SQL Objects
Powershell script

 Let’s first talk about the dependencies:

  • The SMO, so you’re going to need to have either your script or your profile load the 2008 snap-ins or import the 2012 sqlps module.

  • Chad Miller’s Out-DataTable and Write-DataTable (I combined them into one file, DataTables.ps1).

  • The FailOverClusters module, part of the Remote Admin pack if you’re not running the code on a machine with the Windows Server OS, part of the Fail Over Cluster components if you are.

There are two key functions I use, Get-Instance to gather the SQL instance information and Get-Machine to gather the machine information:

#Collects and returns SQL Instance information
function Get-Instance([string]$instcoll,[int]$id,[string]$name)
{
try
{
	$smo = new-object ('Microsoft.SqlServer.Management.Smo.Server') $name
	$sname = $smo.NetName
	$iname = $smo.InstanceName
	if($iname.Length -eq 0 -or $iname -eq $null) { $iname = "MSSQLSERVER" }

	$managedcomp = new-object ('Microsoft.SqlServer.Management.Smo.WMI.ManagedComputer') $sname
	$output = New-Object System.Object

	$port = $managedcomp.ServerInstances[$iname].ServerProtocols["Tcp"].IPAddresses["IPAll"].IPAddressProperties["TcpPort"].Value
	$ip = (Test-Connection $sname -count 1).IPV4Address.ToString()

	$output | Add-Member -type NoteProperty -name InstanceId -value $id
	$output | Add-Member -type NoteProperty -name SQLVersion -value $smo.VersionString
	$output | Add-Member -type NoteProperty -name SQLVersionDesc -value $smo.ProductLevel
	$output | Add-Member -type NoteProperty -name SQLEdition -value $smo.Edition
	$output | Add-Member -type NoteProperty -name MemoryMinMB -value $smo.Configuration.MinServerMemory.RunValue
	$output | Add-Member -type NoteProperty -name MemoryMaxMB -value $smo.Configuration.MaxServerMemory.RunValue
	$output | Add-Member -type NoteProperty -name MAXDOPVal -value $smo.Configuration.MaxDegreeOfParallelism.RunValue
	$output | Add-Member -type NoteProperty -name IP -value $ip
	$output | Add-Member -type NoteProperty -name Port -value $port
	$output | Add-Member -type NoteProperty -name PhysicalHost -value $smo.ComputerNamePhysicalNetBIOS
	return $output
}
catch
{
	write-host "Error collecting $name"
	return $null
}
}#Get-Instance

#Get host machine information via WMI
function Get-Machine([string]$name,[string]$clst)
{
 try
{
	$comp = gwmi Win32_ComputerSystem -Computer $name | select Model,Manufacturer,TotalPhysicalMemory
	$proc = gwmi Win32_Processor -Computer $name | select NumberOfLogicalProcessors,MaxClockSpeed
	$os = gwmi Win32_OperatingSystem -Computer $name | select OSArchitecture,Name,Version,ServicePackMajorVersion,ServicePackMinorVersion

	$output = New-Object System.Object

	$output | Add-Member -type NoteProperty -name MachineName -value $name
	$output | Add-Member -type NoteProperty -name Model -value $comp.Model
	$output | Add-Member -type NoteProperty -name Manufacturer -value $comp.Manufacturer
	$output | Add-Member -type NoteProperty -name Architechture -value $os.OSArchitecture
	$output | Add-Member -type NoteProperty -name PhysicalCPUs -value $(if(!$proc.Length){"1"}else{$proc.Length})
	$output | Add-Member -type NoteProperty -name LogicalCPUs -value ($proc | Measure-Object NumberOfLogicalProcessors -sum).Sum
	$output | Add-Member -type NoteProperty -name CPUSpeed -value ($proc | Measure-Object MaxClockSpeed -max).Maximum
	$output | Add-Member -type NoteProperty -name MaxMemory -value ($comp.TotalPhysicalMemory/1MB)
	$output | Add-Member -type NoteProperty -name OSName -value $os.name.split("|")[0]
	$output | Add-Member -type NoteProperty -name OsVersion -value $os.Version
	$SPMaj = $os.ServicePackMajorVersion
	$SPMin = $os.ServicePackMinorVersion
	$output | Add-Member -type NoteProperty -name SPVersion -value "$SPMaj.$SPMin"
	$output | Add-Member -type NoteProperty -name Cluster -value $clst

	return $output
}
catch
{
	write-host "Error collecting $name"
	return $null
}
}#Get-Machine

Both leverage the SMO and the WMI to get relevant info.  I’ll let you sort through the individual elements I’m pulling, it’s fairly standard config info.  Data gathering, because I think in terms of tables, is a collection of objects that are essentially rows.  I append each new object with all the appropriate properties to a larger collection, then use Write-Datatable to push that data directly into the related stage table.

Look at the entire script for my full process.  I’ve been pretty happy with it and it’s been chugging away in my environments now for about 8-9 months without a whole lot of interaction from me.  If we stand up a new server, then I just pop that name in the instance table, run the job, and everything gets updated.  What made me most proud was that we had a recent meeting with a new VP and pulled up the SSRS report built off of this information to give him an idea of what our environments.  His comment was “This is the most documentation I’ve seen since I’ve got here.”  The beauty is that it was provided without a whole lot of manual maintenance on my part.

This is very much an evolving work on my part, but hopefully gives you some insight into how I manage my environments.  I’ve still got lots of ideas on improvements, including automatic population (or other use) of Central Management Server.  If you have any ideas on how this can be improved, I’d love to hear them.

Power(shell)ing your CMS

If you’re looking at Powershell scripts for SQL Server management, you’ll find a lot of them consume a list of servers in order to perform an action. For example, I put a script out a couple of weeks ago that can monitor the service state for your SQL Servers, emailing out a notice if services are down. This is handy, but maintaining a server list in a file is hardly elegant. For something more dynamic, many shops(including my own) will maintain a list of their servers in their own tracking database, giving you a resource you can query. However, there’s an easier way to handle this if you’re using Central Management Server.

The nice thing about Central Management Server is that, just like SQL Server instances, it can be accessed through the SMO with Powershell. This gives us a handy way to browse our sever listings as a folder structure. Just open up your Powershell window and (assuming you’ve loaded your SQLPS module/snapin), change your directory to:

CD SQLSERVER:\SQLRegistration\Central Management Server Group

Once you’re there, get a listing of the contents with your favorite command (dir/ls/gci) and you’ll see your registered Central Management Server repositories. You can browse from there into the appropriate repository and see a listing of all your registered instances. Now, keep in mind, you’ll still have to browse the directory structure if you’ve created folders to organize your servers into, but now we have access to them as if they were files in a folder.

Let’s take it a step further now and build out a server listing using some pretty basic commands within Powershell. We only have to keep in mind two gotchas:

  • The SMO recognizes both the folders and the instances as “containers”, so we can’t filter by PSIsContainer being false (the object property indicating container-ship).
  • The instance object has both Name and ServerName properties. The Name is the descriptive name, while ServerName is the specific name of the instance we can connect to.

Keeping this in mind, we can build a list by running the following command in the repository:

ls –recurse | where {$_.ServerName –ne $NULL}

Breaking this down, it’s a standard looking command for Powershell. By using “-recurse”, we tell Powershell to recursively browse through all the folders and subfolders in our directory for the lookup. Filtering on “$_.ServerName –ne $NULL” means we exclude the directories (since they don’t have a ServerName). Now we have an array of all of our servers that we can use for processing loops:

$srvs= ls –recurse | where {$_.ServerName –ne $NULL}

ForEach ($srv in $srvs)
{
  $inst=$srv.ServerName
  #do some stuff
}

Another handy element in that object is the ConnectionString property. This provides you with a ready made connection string to use if you need to connect to SQL Server using OLEDB or ODBC calls. While I haven’t had a need to use this yet, I can definitely see the benefits. There are also other properties that can be explored and used, depending on your scenario.

All of this makes for a great extension of Central Management Server, allowing DBAs to leverage the repository for more than just centralizing lookups in Management Studio. This can be a handy tool if you need something with a little less administrative overhead than a server inventory database or some sort of file listing. This means that the Central Management Server, which most of us see as something we only really use within Management Studio, can be leveraged to be so much more.

The Tools to Survive

Last week, I blogged about monitoring backups, service status, and disk space usage. These are the keys to survival for you and your databases. There are several ways that these can be monitored and there are many third party tools that you can use for these tasks. However, third party tools are often not an option.  Fortunately, you can also make use of SQL and Powershell to handle basic monitoring. Here’s a couple scripts that you can use in your environments.

Overdue Backups

The backupset table makes it easy to know when your backups are being taken. By making use of a couple queries, you can find out if your databases are actually being backed up and take appropriate action. This particular script is designed to check for any databases that haven’t had a full backup (type=’D’) in the last three days and email out an alert if any are found.

with db_fulls(dbname,last_backup)
as
(select
	database_name,
	max(backup_start_date)
from
	msdb.dbo.backupset
where
	type='D'
group by
	database_name)
select
	d.name,
	d.recovery_model_desc,
	df.last_backup
into #results
from
	sys.databases d
	left join db_fulls df on (d.name = df.dbname)
where
	isnull(df.last_backup,'1/1/1900') < dateadd(dd,-3,getdate()) 

if (select count(*) from #results) > 1
begin
	DECLARE @tableHTML  VARCHAR(MAX)
	,@emailrecip varchar(200)

	--SET YOUR EMAIL ADDRESS HERE
	set @emailrecip = 'you@yourcompany.com'

	SET @tableHTML =
		N'</pre>
<h1>Overdue Backup Report</h1>
<pre>
' +
		N'</pre>
<h2>No full backups in the last 3 days</h2>
<pre>
' +
		N'</pre>
<table width="1250" border="1">' + N'
<tbody>
<tr>
<th align="left" width="40%">Database</th>
<th align="left" width="30%">Recovery Model</th>
<th align="left" width="30%">Last Backup</th>
</tr>
' + CAST ( ( SELECT td = rtrim(name), '', td = rtrim(recovery_model_desc), '', td = isnull(convert(varchar(100),rtrim(last_backup),100),'NO BACKUP'), '' 
from #results order by name FOR XML PATH('tr'), TYPE ) AS VARCHAR(MAX) ) + N'</tbody>
</table>
<pre>
' ;

	exec msdb.dbo.sp_send_dbmail
		@recipients = @emailrecip,
		@subject = 'Overdue Backup Report',
		@body = @tableHTML,
		@body_format = 'HTML' ;
end

drop table #results

The basics here can also be modified for your particular needs. In my own environment, I use this and a second script that monitors for log backups(type=’L’). I schedule the full backups to run every morning. For the log backups, I run the report every 4 hours and alert me if a database hasn’t had a log backup in the last 12 hours.

Stopped Services

It’s hard to monitor SQL Services from SQL Server itself, so for this task I turned to Powershell. By making use of the Get-Service commandlet, you can collect all service related information from a computer. Add some functionality for email, and you can build a handy little service monitoring script.

$srvlist = Get-Content $args[0]
$report=@()

foreach ($server in $srvlist)
{
	try
	{
          $svcs=Get-Service -ComputerName $server | where {$_.name -like "*SQL*"}
		foreach ($svc in $svcs)
		{
			$output = New-Object System.Object
			$output | Add-Member -type NoteProperty -name Instance -value $server
			$output | Add-Member -type NoteProperty -name SvcName -value $svc.Name
			$output | Add-Member -type NoteProperty -name DisplayName -value $svc.DisplayName
			$output | Add-Member -type NoteProperty -name Status -value $svc.Status
			$report+=$output
		}
	}
	catch
	{
		$output = New-Object System.Object
		$output | Add-Member -type NoteProperty -name Instance -value $server
		$output | Add-Member -type NoteProperty -name SvcName -value "No_Service_Collected"
		$output | Add-Member -type NoteProperty -name DisplayName -value "No Service Collected - COLLECTION ERROR"
		$output | Add-Member -type NoteProperty -name Status -value "ERROR"
		$report+=$output
	}
}

#Set these for your environment
$smtp="yourmail.server.com"
$from="SvcAlert@yourserver.com"
$to="You@yourcompany.com"

if(($report | where {$_.Status -ne "Running"}).Length -gt 0)
{
	[string]$body=$report|where{$_.Status -ne "Running"}| ConvertTo-HTML
	Send-MailMessage -To $to -from $from -subject "Service Monitor Alert!" -smtpserver $smtp -body $body -BodyAsHtml
}

To call this script, you’ll want a text file that contains a list of your servers (passed through as an argument to the script).  Then, run the script as a Windows Scheduled Task.  I recommend setting it up to check the status regularly every few minutes, running from a different computer than your SQL Server. This guards against your monitoring process failing in case your SQL Server shuts down. You will also need to ensure that whatever account the task runs under has remote access to your SQL Server.

Free Space

There is actually quite a bit written about this particular topic. There are two ways you can go about this, depending on whether or not you want to monitor your space from within SQL Server or externally. One option is to use xp_fixeddrives for querying data from directly within SQL Server. If you’ve upgraded to SQL 2008 R2 SP1, you can also make use of sys.dm_os_volume_stats. Personally, I lean more towards making use of Powershell (again) because of the flexibility it gives me. For this, there’s several options, but a good place to start is this script off of Simple Talk. Any of these choices will give you the information you need.

With these basic tools, you can start monitoring your systems for continued operation as well as being alerted when critical issues arise.  From here, you can then move to monitor other aspects of your server and database health.

Survival Monitoring

The IT world is a jungle. Countless threats lurk like predators, ready to devour us if we’re not careful. Seemingly benign events can quickly turn in to raging panic fests and danger lives everywhere that we’re not looking. To survive you need to be prepared and proactive. Unfortunately, many of us are thrown into this with little more than a pat on the back and a smile, as if we just got dropped out of a helicopter into the African jungle with little more than a pack of chewing gum and a pocket knife. Yet, we need to survive not just the next day, but the weeks and months ahead of us.

Because we need to survive, there’s some basic stuff we need to focus on in order to ensure our survival. It’s not everything we need to live happy, contented careers, but the minimum elements we need to watch in our environments to make sure we live to see the next day. If we were trapped in the wilderness, we’d first focus on shelter, fire, and food. In database terms, we need to first keep an eye on backups, services, and disk space if we want to make it to the next day.

Shelter from the storm

The most important item in a DBA’s life is backups. We can have screaming disk, tons of CPU, and all sorts of clustering, but that means nothing if our files get corrupted or the building burns down. Just like shelter in the wilderness is a place where we can always find protection, our database backups will always give us something to recover to.

Keep an eye on three things when it comes to your backups. First, make sure they’re actually occurring. Look to the backupset table in msdb for this, because it will tell you exactly when your backups are occurring, whether they’re log backups, fulls, or differentials. Next, where are your backups located? Backups won’t do much for you if they’re stored to the local computer and then that computer’s hard drive burns up or gets corrupted. Make sure that your backup files get to another location. Finally, make sure your backups work. Just because you take a backup doesn’t always mean that backup is reliable. Perform restores when it’s not an emergency to validate your backups, so you’ll know things will work when it’s an emergency.

Backups are your safety net. No matter what else happens, you should always have them to fall back on. It may not be pretty, but you’ll be glad they’re there when you need them.

Give me fuel, give me fire

Fire gives us the energy to get things done, whether it’s keeping us warm or being used to cook food. This is the same with your SQL Server services. If these aren’t running, your databases are down and your company is losing money. We can’t always prevent the interruption, but we need to be ready to respond when that interruption occurs. As DBAs we need to be proactive and watch our services.

Also, we can’t limit this just to the SQL Server database service. How many of you run SQL Agent jobs to perform your backups and maintenance? I know I do. If the Agent service is down, the databases will be working fine, but none of that other work is getting done. To boot, you probably won’t be getting any notifications about these jobs not running, so this will be one big blind spot.

We can’t take on faith that services will start automatically. Sometimes they stop for completely legitimate reasons. It’s our job to make sure they’re up and running and very few things are worse than that surprise call about something not running because a SQL related service is down (one thing that is worse is that we don’t have a backup, see above). Watch your services and you’ll sleep better and warmer.

How can you have any pudding if you don’t eat your meat?!

People got to eat. Once we have a place to sleep and fire to keep us warm, this is the next thing that we need to keep us going. For databases, this food is disk space. We could expand it out to CPU and RAM, but I’ve seen many a server that will limp along when these are consumed and stop stone cold when a file can’t grow anymore because the server ran out of space. If we want a happy database, we need to keep our database fed.

Primarily, watch the free space on your drives (wherever your files are stored), but also keep an eye on the free space within your files. You need to know when your files are going to grow and consume your space. The immediate survival goal is to make sure your server has enough disk to keep running, but you also need to monitor how that disk is getting so that you can be ready to add disk as necessary.

Getting by

Please note, doing all of the above doesn’t guarantee that your server will hum along happily. This isn’t happiness, this is survival. This is the bare minimum you want to do to ensure your company’s service and data. That’s the thing about monitoring: there are hundreds of counters and statistics you can watch, it’s up to you to figure out which of those are important. That’s why you want to start with the fundamentals first, or you could be putting your data, your job, and your company at risk.

I wanted to start with the overview of this strategy. Stay tuned my next post (might be next week or next Thursday, depends on my schedule), I will actually cover some technical solutions to this monitoring, some SQL and PowerShell scripts you can use to keep an eye on all of the above. If you want to get a head start, take a look at my post on backupset or look at Brent Ozar’s(b|t) sp_blitz.

 

Professionals have standards

A couple weeks ago I wrote about establishing Service Level Agreements (SLAs) before you actually start building out a monitoring solution for your environment. These SLAs help you define what you need to be monitoring and give yourself some control when dealing with the non-DBAs in the company. Now that these expectations have been set on service, your next step is to define the other side of the equation: What you should expect from the databases you must manage. After all, we can promise a level of service, but it’s difficult to meet that promise if there are underlying issues with a database. So now we need to define database standards.

Remember that SLAs are agreements, where both sides have come to an understanding about service expectations. Once we have these, we have to tell developers and third party solutions what is required of them so that we can meet these expectations. Do you want to be saddled with another poorly designed database or not have the hardware to support your transaction load? With a standards document, we can give our non-DBAs clear and certain documentation of what our systems should look like.

Feelings?

Just like many other things we do as DBAs, we know what a good database looks like. The problem is no one else does. Many developers don’t understand the impact of not having a clustered index or the value of defining foreign keys. Executives get confused by disk requirements or backup strategies. Often, because they don’t know what we know, decisions will be made without these considerations. By providing documented standards, the folks who don’t “get it” don’t have to, they can refer to the standards document. If they don’t, we have that document to show them why we can’t support an application the way they want.

Unfortunately, standards are a little harder to define because there are just so many things to go in to what makes a database. This is why we want to start with an SLA framework, so we have some targets to use for building these standards out. We also need an understanding of our environments and what sort of systems our business can and will support (after all, we can’t all get PDWs). The key is to start with what you know and, as you start to lay out these requirements, your standards will start to fall in to place. Consider these examples as bread crumbs to get you headed in the right direction:

  • Database file layouts: Require separation of your database files, so that you can have documentation to back up disk requests for servers.
  • Naming standards: Keep your developers from creating those pesky “sp_” stored procedures. Also, help keep things organized across your environments.
  • Archiving: While you may not be able to explicitly define an archiving strategy, by putting requirements on what kind of data retention you can support , you can proactively manage your disk usage and performance.
  • Indexing strategy: Keep those heap tables from cluttering your database as well as provide guidelines to your developers so that they don’t over-index.

Keep in mind this is not anywhere close to an exhaustive list of what you can build standards on. It will probably take you a fair amount of work to construct your standards document, but it is effort well spent. Remember that you’re not only trying to educate the non-DBAs in your company, but providing a safety net as well for yourself.

The difference is…

We all hate documentation. It is tedious work that keeps us away from all the fun stuff. A lot of times, it feels like wasted work because who reads it, right? The thing is, people will read your documentation, if you make it available. You read a lot of blogs? That’s just us, documenting. So if you create your SLAs and standards and publish them internally, people will pay attention. And if they don’t? Well, you will have a starting point when you say “no”. “Sorry, that design is not supported by our published standards,” has a lot more weight when you actually have published standards.

The developers, the executives…they just want things to work, but they won’t give you what you need unless they know what you need. It’s your job to be proactive and create these policies, because no one else will do it for you. You hear it a lot, how you should be a proactive DBA, not the one that’s always fighting fires, and to do that you need a plan. SLAs and standards, these are your plan, and I challenge you to start on them today so you can take back control of SQL Server and your DBA life.

P.S.  Thematic content for the post headings can be found here (PG-13ish).

Monitoring – Where to start?

It’s real easy in the tech industry to jump in and start slinging code or writing configs.  Most of us learn this way, throwing technological spaghetti up against the wall and seeing what stick.  It’s not what we want to do, though, when we’re looking to put something long lasting in place.  We need a plan for understanding and measuring the health of our SQL Servers, something we can take to the check writers to show them how we’re doing and give them measurable indicators of success.  This is where Service Level Agreements(SLAs) come in.

Ah, you were afraid I was going to say that?  I know for some people, “SLAs” is as bad as any other four letter word.  I hear it all the time “I’ve been doing for a long time, I know what I need to do.”  And it’s true, much of our work comes naturally: how much space do we have on disk, what’s our index fragmentation, is SQL Server up or down, etc.  The SLA, however, is what we can use to not only communicate to our users what we’re watching, but also help control things so we’re not waking up in the middle of the night because a developer dropped a table.

SQL Server high (level)

We need to keep in mind that most of our management doesn’t care about DBCC checks and backup strategies, they just want to know that the database is up and working.  We’ll take care of the details.  Our SLAs provide a framework that helps our users define operational standards without getting lost in the weeds.  My approach is to outline this by breaking my SLA structure into 3 general areas:

Availability: This is the easy one, are your databases or servers up and available?  When we define this SLA, we’re telling our users that their data will be available when they need it.  Some of the elements that fall under this are:

  • Is a database available 24/7 or only during business hours?
  • Do we need high availability?  What sort of service interruption can the database support?
  • In the event of a disaster, how much data will we lose (Recovery Point Objective) and how long will we be down (Recovery Time Objective)?
  • What sort of resource overhead, such as the free space on disk necessary keep the database up and running?
  • When will the database be down for maintenance and how long can that maintenance be?

Performance: Now we get into a little grey area, because we’re tell our users how well the system will perform.  It’s tough to put any parameters around “the database is slow”, but we can put some hard numbers around this, primarily by looking at the server health metrics.  Some things to consider are:

  • Acceptable query response times for standardized processes.
  • Thresholds for CPU, memory, and disk I/O.
  • Blocking process monitoring and resolution.
  • Index and Statistics maintenance.
  • Batch requests per second.

Support: And now it gets even fuzzier.  There’s always going to be drive-bys and break fix requests.  Whether or not you have a service ticket system or some other method for tracking database team requests, you’ll want to define your response and resolution times as well as work with your customers to determine what’s acceptable from a customer service perspective.  Also, if you have custom code or application procedures, you’ll probably want to have documented what level of response items like failed SQL job, ETL loads, and similar processes will get.

Not being all things

The next thing to define is your tiers of service.  Just like you shouldn’t be all things to all users, you won’t be able to support all your applications in the same way.  By defining your different application types, you can set reasonable expectations for your support.  This area is fairly open, because you might have only two types of applications, or you might have four or five.  A sample breakdown might look like this:

  • Critical 24X7 applications – Applications in this category must be up 24 hours a day, 7 days a week and be able to handle continuous transaction load.  Interruption or degradation of service is considered a priority and requires immediate response.
  • High demand business applications – Applications in this category need to be available 24 hours a day during the business week, midnight Monday to midnight Friday.  Interruption of service requires quick response.
  • Standard business applications – Applications in this category should be available during standard business hours, Monday through Friday.  Interruption of service should be limited during business hours.
  • Lower environments – All non-productions environments should be maintained and supported to allow development resources to be able to complete projects.  Downtime should be kept to a reasonable minimum during working hours.

Your tiers will likely be driven by availability needs first.  The key is being able to categorize your application needs so that you can respond appropriately.  This is also to protect you, because if you’re burned out trying to get development boxes taken care of, you won’t be able to handle things if that webserver handling your company’s sales goes down.

Agreeing on something

The key with creating these frameworks is to remember it’s an agreement between you and your users or customers.  You want to work with them to create reasonable expectations so that if something does go haywire, both sides will know how the issue will be handled and what sort of time frames are in place for resolution.

The other piece of this being an agreement is your users or customers will need to meet you part way.  If you have SLAs defined for a highly available 24X7 application, you want your developers to make sure that the application can meet those needs.  It’s key that your users understand their stake in providing good service, such as calling to report errors or ensuring that an application can handle a database failing underneath it.

It’s a start

I’ll be honest, I’m still working on fleshing much of the above out as I roll out an SLA framework at my company.  What you see here will undergo some tweaking and fine tuning over the next few months as I get some more definition around it, but the intent is to follow the basic structure I’ve laid out here.  There will be more specifics under each of the three areas that will be shared as they develop.  If you have anything to contribute or are curious about, please feel free to leave your input or questions in the comments and I will respond as I can.

Preparing for 2012 (part 3)

If you missed the earlier installments, check out part 1 and/or part 2.

Getting Proactive

SQL Server is a large product and it’s getting bigger every day. Within it we have engine performance, query tuning, encryption, disaster recovery, report writing, data analysis, ETL processing…the list is a lot like the Energizer Bunny in how long it is. Trying to know all of that is pretty much impossible, so the database administrators who truly excel in the field become experts on a certain area of SQL Server and rely on their SQL family to help with the areas they may not know so well.

It’s hard to do, because in our jobs we’re usually only one of a couple folks who do what we’re do and we are expected to handle whatever gets thrown at us. I typically have to touch 4-5 distinctly different areas of our discipline just to fight the fires of the day. Like all fires, they need to be solved quickly. I can’t spend a lot of focused time learning about those different areas, because the system can’t be down or running slowly.

This, of course, ends up being a very reactive approach to learning. We only learn what we need in order to solve the problem at hand, then move on to the next one. Even then, we usually don’t learn much, just copy a script or grab a job, verify that it won’t break anything, then implement and get going. It’s continually preached (rightfully so!) that managing our databases in a reactive fashion is a recipe for disasters, all-nighters, and RGE’s, why do we think it would be any different with our education?

The challenge to myself is to stop being reactive in my learning. While it’s cool to be like a sponge in a sink, soaking up whatever is available, I’m only able to scratch the surface of whatever happens to be the topic du jour. To do that I need to take a discipline within SQL Server, focus on it, and make it my own.

Focus, Daniel-san!

Unfortunately, there’s no crane kick or leg sweep that is the magic key to my DBA success. Any number of areas would be valuable, I just have to pick one. My decision, which really only came to me in the last month, is to focus on SQL Server monitoring. The reasons are simple:

  1. In order to have the most stable environment, you need to know what’s going on in your environment. To do this, you need established baselines and systems to monitor your environment.
  2. To stop problems before they become serious, you need to have systems in place to watch for issues and deviance in your operations.
  3. You can’t start corrective action, such as adding hardware resources or query tuning, without first understanding the problem. Proper monitoring will give you the signposts to indicate what your problem is.
  4. If you want to impress the boss, the easiest way is to create reports and display the data on your environment and how it is changing. While I tend to think that doing your job well is the best way to impress the boss, your boss may not always understand how good of a job your doing unless you show him the data.

Over the next year, much of this blog will be about researching and implementing monitoring. It will include planning elements such as service level agreements (SLAs), third party monitoring tools, and reporting queries, along with other items as I come across them. Hopefully it will become more than just my brain dump area, but it might not. Either way, you should be able to follow along with the solutions and methodologies I develop and learn along with me.

Look out 2012, it’s gonna be a wild ride!

Brushing up on your (backup) history

Backups are huge, right? As administrators, they should never be far from our minds. Backups are one of the cornerstones of disaster recovery and safeguarding our data. It only makes sense that we want to keep an eye on when backups happen and if they’re successful. Recently, I’ve been making a lot of use of a simple little query that shows me how my backups are doing:

SELECT
  database_name,
  type,
  max(backup_start_date) last_backup
FROM
  msdb.dbo.backupset
GROUP BY
  database_name,
  type

Yes, this is a VERY straightforward query, no hidden magic here. But with this one little SELECT statement, I have been able to quickly establish if my backups are getting run, what specific databases may have problems, and where I need to focus my troubleshooting. By breaking it down by database and type, I am also able determine whether or not my log backups are running (key to out of control log growth), if I’m properly maintaining my recovery chain, and what backups are going to be necessary to restore my database.

Unlocking the secrets

The key is the the backupset table in msdb. This table stores the history of all successful backup operations on the server(emphasis on successful). It’s not just native SQL backups, but those by third party tools as well, so we can have insight in to any backups (or lack of) that are happening. I first came across this table working with Brent Ozar’s(b|t) Blitz script. Brent introduces this query as part of his server take over to establish whether or not your backups are being taken, which is the primary reason I check it. Backing up our data is vital and not doing this would probably be an RGE. If nothing else, this table says “Hey dummy, I need you to look at something!”

If we dig in a little more, though, we can find other useful information. Take the backup_start_date and backup_finish_date fields. By using another simple query, we can then see how our backups perform over time:

SELECT
  database_name,
  backup_start_date,
  datediff(mi,backup_start_date,backup_finish_date) backup_duration,
  backup_size/(datediff(mi,backup_start_date,backup_finish_date)+1) bytes_per_minute
FROM
  msdb.dbo.backupset

Backup duration is a handy metric for base-lining our systems. After all, if our backups start taking longer and longer, it could be an indication of resource contention. If we’re backing up to local disk, we could be choking our throughput. If we’re backing up to a network share, we could be seeing increased bandwidth usage. Also, our databases could just be getting bigger and might require more resources to support them. By including the backup_size field, we can see if longer backups are the result of simply backing up more data or if there might be something else to it.

If we use the backup_size field and its sibling, the compressed_backup_size field, we can gain some additional insights. Since backups will grow in relation to how much the database grows, this gives us a quick and dirty way to report on those growth patterns. Ideally, you’d want to be track the actual database size, but if you don’t have something to capture that historically, this will work in a pinch. Additionally, by combining it with the compressed backup size, you can get an idea of how much space you’re saving if you use the WITH COMPRESSION option in SQL 2008. A quick gotcha here: This will not show the benefits of compression using third party backup tools. Both fields will show the fully compressed backup size if you’re using something like Red Gate SQL Backup, Quest LiteSpeed, or Idera’s SQL Safe.

Cleaning up after yourself

Now not everything with this table is sunshine and rainbows. Maintaining this history is one of the more commonly overlooked areas of SQL Server administration. Ever wonder why your msdb database keeps growing and growing over time? Bingo, it’s because your backupset table keeps increasing with all the additional history stored inside it. What you want to do is schedule a regular run of sp_delete_backuphistory to keep your history to a relevant period. What that time frame is depends on your database size and needs, though I personally go with 90 days as my default.

Those who don’t learn from history…

Successful database administrators are the ones who make monitoring a part of their practice. With it, we can be proactive with our systems and solve problems before they occur. The great thing about the backupset table is that it’s a very easy way to start checking up on your instances, leveraging information that is already a part of the SQL Server core functionality. You don’t need any fancy tools or complex systems, just a few basic queries to get started understanding how your servers, databases, and systems perform on a day to day basis.