Art of the DBA Rotating Header Image

SLAs

The Hierarchy of Monitoring Needs

For those not aware, I’ve been selected as a presenter at this year’s PASS Summit in Charlotte, NC.  I’m deeply honored by the selection and look forward to the experience.  The topic is something I’ve been working on for sometime now:  How should administrators approach monitoring?  This seems like a simple question, but go Bingle “SQL Server Monitoring” and see what you get.  Go ahead, I’ll wait.

The results are either tools or metrics.  While these are useful, it highlights a significant gap out there: a methodology.  The closest you’ll find out there is Adam Machanic’s (@adammachanic) Ten Commandments Of SQL Server Monitoring (which applies to all monitoring, really, but I digress).  The goal with my PASS presentation is to provide a “how to” plan for creating your overall monitoring strategy.

Hierarchy of Monitoring Needs

How does this methodology work?  Several years ago I was introduced to Maslow’s Hierarchy of Needs.  It’s a structured view of what humans need, built around a tiered view where a person addresses each level before they are concerned with the next set of needs.  The template carries well, such that Brent Ozar(@brento) even wrote about his own Hierarchy of Database Needs a while back.

So what’s my approach?  To get the detail, you’ll need to come to my PASS session, but I’ll let the overview whet your appetite:

Survival – There are some factors database administrators must be concerned with.  As guardians of the data, we have to ensure three things about our systems before we can worry about anything else:

    • Reliable backups – You hear this ad nauseum, but it’s true.  Without backups, we’re crossing the high wire without a net.  Note, also, that these are “reliable”.  You need to know more than your backups are happening, but can you find them and are they usable.
    • Service Availability – Is SQL Server up?  Is it responding to queries?  While this seems like a “duh” moment, if our users can’t access their data, then we’re in almost as much trouble as we were if we can’t recover from a backup.
    • Space – This is not long range capacity planning that we will worry about later down the road, but a validation that we’re not “red lined” for space.  Because if we run out of space, we affect availability and use of our databases.

Knowledge  – In order to identify what to monitor, we need to know what to be monitoring for, otherwise we’re wandering in the mountains without a map.  We establish this by working with the business to create our Service Level Agreements.  I wrote about SLAs a year ago and my approach is still the same:

    • Availability – Users want to know their data is queryable, protected, and secure.  We need to define our customer’s expectations for high availability, disaster protection, maintenance windows, and security.
    • Performance – Users want their data fast, but we need to put definition around this.  What defines a query as to slow?  How long can things be blocked? What sort of resource usage is allowed within our environments?
    • Support – While this only partially affects our monitoring, our users want to know who’s “throat to choke” if there’s an issue.  For monitoring, that defines escalation patterns and critical/non-critical alerts.

Direction – Metrics and whether they’re to high or to low are commonly what DBAs think of when they discuss monitoring.  Are my batches/second to high?  What’s my I/O throughput?  Page Life Expectancy? Wait types? The list goes on and on.  However, like many things, DBAs want to put the cart before the horse and just start looking at stats.  By dealing with SLAs first, we’ll be better equipped to monitor what’s important to the business rather than what’s important to us.

Actualization – After we know what to look for, it’s time to decide how we look for it, and this leads us into the discussion of tools.  The problem is, there are a load of tools out there you can use.  Do you go with a specialized third product?  Maybe some larger, more generic system that integrates with other parts of the enterprise?  Do you write your own, using what’s freely available and paying the cost in time to build?  Whatever the case, having a list of metric needs equips us with the knowledge necessary to identify what tool set best suits our situation.

Understanding – One of the trickiest parts of monitoring is knowing what’s a problem and what is just normal behavior.  Once our metrics are defined and our tools are set up, we need to collect baselines and map trends to understand the health of our enterprise.  While our SLAs can give us a starting point, we need to study our environments over time.

Refinement – Once everything is in place, we can start the cycle of fine tuning our monitoring.  Systems and processes change over time.  New storage will need to be added and new applications brought on line.  With we are able to periodically review our trends and anticipate future needs.  Managing our metrics based on these changes and adapting our monitoring is an ongoing process and is never set in stone.

Just so you know, this is just the 30,000 foot view that gives administrators the basic map.   At Summit I’ll discuss the territory in a little more detail as well as some scripts, templates, and additional information to help you with your own monitoring needs.  I’m very excited to be sharing this at Summit and hope that you’ll all join me for it.

Professionals have standards

A couple weeks ago I wrote about establishing Service Level Agreements (SLAs) before you actually start building out a monitoring solution for your environment. These SLAs help you define what you need to be monitoring and give yourself some control when dealing with the non-DBAs in the company. Now that these expectations have been set on service, your next step is to define the other side of the equation: What you should expect from the databases you must manage. After all, we can promise a level of service, but it’s difficult to meet that promise if there are underlying issues with a database. So now we need to define database standards.

Remember that SLAs are agreements, where both sides have come to an understanding about service expectations. Once we have these, we have to tell developers and third party solutions what is required of them so that we can meet these expectations. Do you want to be saddled with another poorly designed database or not have the hardware to support your transaction load? With a standards document, we can give our non-DBAs clear and certain documentation of what our systems should look like.

Feelings?

Just like many other things we do as DBAs, we know what a good database looks like. The problem is no one else does. Many developers don’t understand the impact of not having a clustered index or the value of defining foreign keys. Executives get confused by disk requirements or backup strategies. Often, because they don’t know what we know, decisions will be made without these considerations. By providing documented standards, the folks who don’t “get it” don’t have to, they can refer to the standards document. If they don’t, we have that document to show them why we can’t support an application the way they want.

Unfortunately, standards are a little harder to define because there are just so many things to go in to what makes a database. This is why we want to start with an SLA framework, so we have some targets to use for building these standards out. We also need an understanding of our environments and what sort of systems our business can and will support (after all, we can’t all get PDWs). The key is to start with what you know and, as you start to lay out these requirements, your standards will start to fall in to place. Consider these examples as bread crumbs to get you headed in the right direction:

  • Database file layouts: Require separation of your database files, so that you can have documentation to back up disk requests for servers.
  • Naming standards: Keep your developers from creating those pesky “sp_” stored procedures. Also, help keep things organized across your environments.
  • Archiving: While you may not be able to explicitly define an archiving strategy, by putting requirements on what kind of data retention you can support , you can proactively manage your disk usage and performance.
  • Indexing strategy: Keep those heap tables from cluttering your database as well as provide guidelines to your developers so that they don’t over-index.

Keep in mind this is not anywhere close to an exhaustive list of what you can build standards on. It will probably take you a fair amount of work to construct your standards document, but it is effort well spent. Remember that you’re not only trying to educate the non-DBAs in your company, but providing a safety net as well for yourself.

The difference is…

We all hate documentation. It is tedious work that keeps us away from all the fun stuff. A lot of times, it feels like wasted work because who reads it, right? The thing is, people will read your documentation, if you make it available. You read a lot of blogs? That’s just us, documenting. So if you create your SLAs and standards and publish them internally, people will pay attention. And if they don’t? Well, you will have a starting point when you say “no”. “Sorry, that design is not supported by our published standards,” has a lot more weight when you actually have published standards.

The developers, the executives…they just want things to work, but they won’t give you what you need unless they know what you need. It’s your job to be proactive and create these policies, because no one else will do it for you. You hear it a lot, how you should be a proactive DBA, not the one that’s always fighting fires, and to do that you need a plan. SLAs and standards, these are your plan, and I challenge you to start on them today so you can take back control of SQL Server and your DBA life.

P.S.  Thematic content for the post headings can be found here (PG-13ish).

Monitoring – Where to start?

It’s real easy in the tech industry to jump in and start slinging code or writing configs.  Most of us learn this way, throwing technological spaghetti up against the wall and seeing what stick.  It’s not what we want to do, though, when we’re looking to put something long lasting in place.  We need a plan for understanding and measuring the health of our SQL Servers, something we can take to the check writers to show them how we’re doing and give them measurable indicators of success.  This is where Service Level Agreements(SLAs) come in.

Ah, you were afraid I was going to say that?  I know for some people, “SLAs” is as bad as any other four letter word.  I hear it all the time “I’ve been doing for a long time, I know what I need to do.”  And it’s true, much of our work comes naturally: how much space do we have on disk, what’s our index fragmentation, is SQL Server up or down, etc.  The SLA, however, is what we can use to not only communicate to our users what we’re watching, but also help control things so we’re not waking up in the middle of the night because a developer dropped a table.

SQL Server high (level)

We need to keep in mind that most of our management doesn’t care about DBCC checks and backup strategies, they just want to know that the database is up and working.  We’ll take care of the details.  Our SLAs provide a framework that helps our users define operational standards without getting lost in the weeds.  My approach is to outline this by breaking my SLA structure into 3 general areas:

Availability: This is the easy one, are your databases or servers up and available?  When we define this SLA, we’re telling our users that their data will be available when they need it.  Some of the elements that fall under this are:

  • Is a database available 24/7 or only during business hours?
  • Do we need high availability?  What sort of service interruption can the database support?
  • In the event of a disaster, how much data will we lose (Recovery Point Objective) and how long will we be down (Recovery Time Objective)?
  • What sort of resource overhead, such as the free space on disk necessary keep the database up and running?
  • When will the database be down for maintenance and how long can that maintenance be?

Performance: Now we get into a little grey area, because we’re tell our users how well the system will perform.  It’s tough to put any parameters around “the database is slow”, but we can put some hard numbers around this, primarily by looking at the server health metrics.  Some things to consider are:

  • Acceptable query response times for standardized processes.
  • Thresholds for CPU, memory, and disk I/O.
  • Blocking process monitoring and resolution.
  • Index and Statistics maintenance.
  • Batch requests per second.

Support: And now it gets even fuzzier.  There’s always going to be drive-bys and break fix requests.  Whether or not you have a service ticket system or some other method for tracking database team requests, you’ll want to define your response and resolution times as well as work with your customers to determine what’s acceptable from a customer service perspective.  Also, if you have custom code or application procedures, you’ll probably want to have documented what level of response items like failed SQL job, ETL loads, and similar processes will get.

Not being all things

The next thing to define is your tiers of service.  Just like you shouldn’t be all things to all users, you won’t be able to support all your applications in the same way.  By defining your different application types, you can set reasonable expectations for your support.  This area is fairly open, because you might have only two types of applications, or you might have four or five.  A sample breakdown might look like this:

  • Critical 24X7 applications – Applications in this category must be up 24 hours a day, 7 days a week and be able to handle continuous transaction load.  Interruption or degradation of service is considered a priority and requires immediate response.
  • High demand business applications – Applications in this category need to be available 24 hours a day during the business week, midnight Monday to midnight Friday.  Interruption of service requires quick response.
  • Standard business applications – Applications in this category should be available during standard business hours, Monday through Friday.  Interruption of service should be limited during business hours.
  • Lower environments – All non-productions environments should be maintained and supported to allow development resources to be able to complete projects.  Downtime should be kept to a reasonable minimum during working hours.

Your tiers will likely be driven by availability needs first.  The key is being able to categorize your application needs so that you can respond appropriately.  This is also to protect you, because if you’re burned out trying to get development boxes taken care of, you won’t be able to handle things if that webserver handling your company’s sales goes down.

Agreeing on something

The key with creating these frameworks is to remember it’s an agreement between you and your users or customers.  You want to work with them to create reasonable expectations so that if something does go haywire, both sides will know how the issue will be handled and what sort of time frames are in place for resolution.

The other piece of this being an agreement is your users or customers will need to meet you part way.  If you have SLAs defined for a highly available 24X7 application, you want your developers to make sure that the application can meet those needs.  It’s key that your users understand their stake in providing good service, such as calling to report errors or ensuring that an application can handle a database failing underneath it.

It’s a start

I’ll be honest, I’m still working on fleshing much of the above out as I roll out an SLA framework at my company.  What you see here will undergo some tweaking and fine tuning over the next few months as I get some more definition around it, but the intent is to follow the basic structure I’ve laid out here.  There will be more specifics under each of the three areas that will be shared as they develop.  If you have anything to contribute or are curious about, please feel free to leave your input or questions in the comments and I will respond as I can.