Art of the DBA Rotating Header Image

Monitoring – Where to start?

It’s real easy in the tech industry to jump in and start slinging code or writing configs.  Most of us learn this way, throwing technological spaghetti up against the wall and seeing what stick.  It’s not what we want to do, though, when we’re looking to put something long lasting in place.  We need a plan for understanding and measuring the health of our SQL Servers, something we can take to the check writers to show them how we’re doing and give them measurable indicators of success.  This is where Service Level Agreements(SLAs) come in.

Ah, you were afraid I was going to say that?  I know for some people, “SLAs” is as bad as any other four letter word.  I hear it all the time “I’ve been doing for a long time, I know what I need to do.”  And it’s true, much of our work comes naturally: how much space do we have on disk, what’s our index fragmentation, is SQL Server up or down, etc.  The SLA, however, is what we can use to not only communicate to our users what we’re watching, but also help control things so we’re not waking up in the middle of the night because a developer dropped a table.

SQL Server high (level)

We need to keep in mind that most of our management doesn’t care about DBCC checks and backup strategies, they just want to know that the database is up and working.  We’ll take care of the details.  Our SLAs provide a framework that helps our users define operational standards without getting lost in the weeds.  My approach is to outline this by breaking my SLA structure into 3 general areas:

Availability: This is the easy one, are your databases or servers up and available?  When we define this SLA, we’re telling our users that their data will be available when they need it.  Some of the elements that fall under this are:

  • Is a database available 24/7 or only during business hours?
  • Do we need high availability?  What sort of service interruption can the database support?
  • In the event of a disaster, how much data will we lose (Recovery Point Objective) and how long will we be down (Recovery Time Objective)?
  • What sort of resource overhead, such as the free space on disk necessary keep the database up and running?
  • When will the database be down for maintenance and how long can that maintenance be?

Performance: Now we get into a little grey area, because we’re tell our users how well the system will perform.  It’s tough to put any parameters around “the database is slow”, but we can put some hard numbers around this, primarily by looking at the server health metrics.  Some things to consider are:

  • Acceptable query response times for standardized processes.
  • Thresholds for CPU, memory, and disk I/O.
  • Blocking process monitoring and resolution.
  • Index and Statistics maintenance.
  • Batch requests per second.

Support: And now it gets even fuzzier.  There’s always going to be drive-bys and break fix requests.  Whether or not you have a service ticket system or some other method for tracking database team requests, you’ll want to define your response and resolution times as well as work with your customers to determine what’s acceptable from a customer service perspective.  Also, if you have custom code or application procedures, you’ll probably want to have documented what level of response items like failed SQL job, ETL loads, and similar processes will get.

Not being all things

The next thing to define is your tiers of service.  Just like you shouldn’t be all things to all users, you won’t be able to support all your applications in the same way.  By defining your different application types, you can set reasonable expectations for your support.  This area is fairly open, because you might have only two types of applications, or you might have four or five.  A sample breakdown might look like this:

  • Critical 24X7 applications – Applications in this category must be up 24 hours a day, 7 days a week and be able to handle continuous transaction load.  Interruption or degradation of service is considered a priority and requires immediate response.
  • High demand business applications – Applications in this category need to be available 24 hours a day during the business week, midnight Monday to midnight Friday.  Interruption of service requires quick response.
  • Standard business applications – Applications in this category should be available during standard business hours, Monday through Friday.  Interruption of service should be limited during business hours.
  • Lower environments – All non-productions environments should be maintained and supported to allow development resources to be able to complete projects.  Downtime should be kept to a reasonable minimum during working hours.

Your tiers will likely be driven by availability needs first.  The key is being able to categorize your application needs so that you can respond appropriately.  This is also to protect you, because if you’re burned out trying to get development boxes taken care of, you won’t be able to handle things if that webserver handling your company’s sales goes down.

Agreeing on something

The key with creating these frameworks is to remember it’s an agreement between you and your users or customers.  You want to work with them to create reasonable expectations so that if something does go haywire, both sides will know how the issue will be handled and what sort of time frames are in place for resolution.

The other piece of this being an agreement is your users or customers will need to meet you part way.  If you have SLAs defined for a highly available 24X7 application, you want your developers to make sure that the application can meet those needs.  It’s key that your users understand their stake in providing good service, such as calling to report errors or ensuring that an application can handle a database failing underneath it.

It’s a start

I’ll be honest, I’m still working on fleshing much of the above out as I roll out an SLA framework at my company.  What you see here will undergo some tweaking and fine tuning over the next few months as I get some more definition around it, but the intent is to follow the basic structure I’ve laid out here.  There will be more specifics under each of the three areas that will be shared as they develop.  If you have anything to contribute or are curious about, please feel free to leave your input or questions in the comments and I will respond as I can.

One Comment

  1. […] Monitoring – Where to start? – Taking it to the next level, Service Level that is, it’s Mike Fal (Blog|Twitter) showing how you can lay the ground work for professional quality monitoring for your customers. […]

Leave a Reply

Your email address will not be published. Required fields are marked *