Art of the DBA Rotating Header Image

TIL: Broken Log Chains

Something that always bites me in the butt from time to time is restoring a database in full recovery model and forgetting to do a full right away.  Then I find out shortly when my log backups start failing and spamming me with error messages.  I’m fairly certain this has happened to you, as well.  It’s an easy thing to forget, especially when you’re dealing with non-critical databases.

Humans are typically bad at remembering things.  That’s why we make task lists, have calendar reminders, document processes, and script for automation.  We either have to remind ourselves to do something or (my preference) build something that’s smart enough to correct our oversight.  So when I found a way to do this with broken log chains, I was pretty happy.

The key was a new dynamic management view (DMV) I discovered.  Most people who know me know I love the DMVs and all the information they can provide.  My new discovery is the sys.database_recovery_status view, which provides recovery info for all the databases.  The key field is the last_log_backup_lsn field, which will be NULL if the log chain has somehow been broken.

What’s cool about this is now we have a programmatic way we can use to validate if a full backup needs to be performed.  By adding a quick check to a log backup script, we can make sure we never have a broken log chain.  A quick and dirty way would be:

if (select last_log_backup_lsn from sys.database_recovery_status where database_id = db_id(‘foo’)) is NULL
begin
    backup database [foo] to disk=<<path to full backup>>
end

backup log [foo] to disk=<<path to log backup>>

This is obviously the most simple approach you can use, but the pattern is the important piece.  By combining this with your standard backup scripts (please tell me you have standard backup scripts), you can always ensure that you never get an error because you forgot a full backup.

Because your script will remember for you.

 

#tsql2sday 52: Stop depending on “it depends”

depends-on-who-is-askingIt seems to be a standard practice for data professionals to say “It depends” when asked for suggestions, recommendations, or outright solutions.  This isn’t surprising, because in the “magic” world of technology the actual solution for a problem usually requires some very specific information and that any number of solutions could work.  Add in to this the nature of our world usually puts you at the whip end when things break, so we are natuarally hesitant to draw a line in the sand.

Stop it.

Let’s face it, much of what we do is fairy dust and moonbeams to those outside our industry.  They don’t understand what we do, just that we make it happen.  And that’s what managers, directors, veeps, and c-fillintheblank-os want when they ask us questions: How can we make it happen and/or how to we make the broken thing go away.  The problem with saying “It depends”, especially in situations like that, is that it sounds more like “I don’t know” and that’s not what you’re getting paid for.  You’re the technical expert.  Whether you’re a consultant or a full-time employee, a seasoned veteran or a fresh-faced n00b, you’re being paid to understand the technology and be able to implement solutions.

Many data professionals, however, are afraid to take this sort of stand.  A lot of this stems from the tremendous amount of risk involved, particularly when tied in to the heavy responsibility we tend to bear.  Our job is to protect the company’s data assets and if we screw up it’s not just our head, but it could be seriously damaging to our company.  So we like to hedge our bets.  Unfortunately, a lot of people in our profession will use the phrase “It depends” as a dodge because they’re afraid of completely taking on that responsibility.

Ultimately, they’re afraid of failure.

It’s a common mantra in life coaching that we can’t be afraid of failure.  Failure is what we learn from and how we grow.  Make a mistake, analyze it, and grow from it.  We’re only human, however, and screwing up is scary.  We don’t want to look bad and we don’t want to get in trouble.  That doesn’t help the people who are looking to us for help.  This is when saying “It depends” as a shield turns into a roadblock, hindering both you and your organization from getting anywhere.

So what do we do about it?  Next time someone asks you for a technical opinion and you don’t have enough detail, ask for that detail.  What’s the RTO/RPO? How quickly does this need to perform?  What’s our budget?  Questions like that to refine the answer.  Maybe outline a solution, but caveat it with qualifiers, such as “I’d probably put tempdb on SSDs, but that assumes we can afford that sort of hardware.”  Maybe there’s a better way to do it, maybe you’re just wrong.  But it’s ok to make a mistake, as long as you’re not making the same mistake.

Most of all, I’d suggest to simply remove “It depends” from your vocabulary.  There’s a dozen ways to get the point across that a solution will require some thought and planning, but I found that when I was forced to use something other than this two-word quip, I had to work harder on my response to really explain the dependencies for a solution.  And the non-technical folks around you are ok with that.  Sure, they don’t want you talking above their head, but they also need to know why things need to be done a certain way.

Some folks might call this leadership.  Others, sticking your neck out. Still others might call this being unnecessarily risky.  I call it doing your job.  Like I said, companies pay us to know our field, it’s time we act like it.  Data is our world, very few people live it in and understand it the way we do, so own that knowledge and champion it.  And stop hiding behind that phrase, “It depends”.

(This month’s T-SQL Tuesday, a grand tradition started by Adam Machanic, is being hosted by Michael J. Swart(@MJSwart).  Great topic choice, sir!)

A PoSH way to skin the cat

Today my friend John Morehouse(@SqlrUs) posted a handy T-SQL script to configure your SQL Server memory based on some rules provided by Jonathan Kehayias(@SQLPoolBoy).  This is a great script and helps automate a lot of the work a DBA needs to do when setting up a new instance. Hopefully, you know by now about my passion for automating and standardization.  However, when reading John’s post, I had the following internal conversation:

“Wow, self, John’s got a neat little script here.”

“Yeah, but T-SQL is so passé.  Everyone does it.  Plus, not sure how easily we could add that into a server deployment script if we wanted total hands-off-ness.”

“Self, I know what you mean.  Nothing wrong with it, it just doesn’t fit my own particular….oh…particular….”

Idiom?

“Yes!  Idiom!  Hrmmm…how about a Powershell script?”

“A Powershell script?  Brilliant!”

And so I sat down and cranked out the following script, which essentially does what John’s script does, just using Powershell.

</pre>
<#
.SYNOPSIS
 Configures a SQL Server instance per the Jonathan Kehayias' guidelines.
.DESCRIPTION
 This script will configure your SQL Server instance per the guidelines
 found in Jonathan Kehayias' blog post: http://www.sqlskills.com/blogs/jonathan/how-much-memory-does-my-sql-server-actually-need/
 The rules are:
 - 1GB for initial OS reserve
 - +1GB per 4GB server RAM up to 16GB
 - +1GB per 8GB server RAM above 16
.PARAMETER
 -instance SQL instance name, i.e. localhost\SQL2012, DBASERVER01
 -apply Switch parameter, call if you want to actually apply the changes. Otherwise, a report will be produced.
.EXAMPLE
 Configure-SQLMemory -instance DBASERVER01 -apply
#>

param([parameter(Mandatory=$true)][string] $instance
 , [Switch] $apply
 )

#load SMO
[System.Reflection.Assembly]::LoadWithPartialName('Microsoft.SqlServer.SMO') | out-null

if($instance.Contains("\")){
 $sqlhost = $instance.Split("\") | Select -First 1
 }
else{
 $sqlhost = $instance
 }

#set memory variables
$totalmem = (gwmi Win32_ComputerSystem -computername $sqlhost).TotalPhysicalMemory/1GB
$sqlmem = [math]::floor($totalmem)

#calculate memory
while($totalmem -gt 0){
 if($totalmem -gt 16){
 $sqlmem -= [math]::floor(($totalmem-16)/8)
 $totalmem=16
 }
 elseif($totalmem -gt 4){
 $sqlmem -= [math]::floor(($totalmem)/4)
 $totalmem = 4
 }
 else{
 $sqlmem -= 1
 $totalmem = 0
 }
}

#if not in debug mode, alter config. Otherwise report current and new values.
$srv = New-Object -TypeName Microsoft.SqlServer.Management.Smo.Server $instance
 "Instance:" + $instance
 "Max Memory:" + $srv.Configuration.MaxServerMemory.ConfigValue/1024 + " -> " + $sqlmem
 "Min Memory:" + $srv.Configuration.MinServerMemory.ConfigValue/1024 + " -> " + $sqlmem/2
if($apply){
 $srv.Configuration.MaxServerMemory.ConfigValue = $sqlmem * 1024
 $srv.Configuration.MinServerMemory.ConfigValue = $sqlmem/2 * 1024
 $srv.Configuration.Alter()
 "Configuration Complete!"
 }
<pre>

Now it should be noted that I’m not going to tell you which of these scripts are “better”. Neither is, really, they’re just different ways of approaching the problem. And that’s the fun of it. When working in technology, cats can be skinned in multiple ways, we just have to use something. Whether it’s T-SQL, Powershell, C#, or some other thing, the point is that you want scripts to automate your work.

What was also cool about this was it gave me another exercise to practice Powershell.  After all, practice makes perfect and when learning something, any excuse to make me use it is good.  John and I actually had a conversation shortly after I wrote the script about these sort of opportunities, and we’ll continue to look for ways to challenge each other for learning.

Edit:  Well, that was embarrassing.  Amazing how sometimes you can get the math little screwy.  I’ve updated this script as of 9:44 PM MST to correct me having the math backwards.  If you grabbed/reviewed this script before hand, please make sure up grab the updated version.

Why I Work With SQL Server

Hot on the heels of my NoSQL posts, I wanted to add a counterpoint to the discussion.  After all, even though I see the value of non-relational technologies, I think it’s important not to lose sight of the value relational databases offer.  In the tech world, it’s too easy to chase those squirrels of new tech (though it’s also easy to get stuck in our old patterns as well).  It always helps to take a step back and see the forest for the trees so we can choose the right path for our enterprise.

It is an understood fact that the tech world gets pretty dogmatic:  Oracle vs. SQL Server, Windows vs. Linux, Java vs. C#, etc.  People will dig their heels in about their choices and why those choices are considered “right” when, at the end of the day, each platform is simply a different approach to various higher concepts.  I tend to view most of these debates as Ford vs. Chevrolet and the only real question to answer is what tool is best for the job.

And believe me when I say that I know that this isn’t a groundbreaking opinion, but it is mine.  :)

That being said, we all have good reasons for selecting the platforms we work with.  For relational databases, it’s fairly evident that my choice is SQL Server.  Before I get into that, let’s first talk about why I lean towards relational over non-relational.  Don’t get me wrong, non-relational is an effective tool, but it’s still a very young technology.  The platforms for it are still growing and maturing, where they still are missing a lot of the reliability we’ve come to expect from our relational platforms.

Couple that with the nature of relational databases:  Joins, keys, and constraints do more for us than simply organize data, they provide functionality to implement and control business logic.  Data integrity is extremely important for many applications and a proper database design will provide you with all the rules to keep your data clean and ordered.  Just as with choosing non-relational stores, it’s a matter of choosing the appropriate tool for the job.  Sometimes that job requires tight control over your data, something that you just can’t get in a NoSQL database.

As for SQL Server as my relational platform of choice, there’s a lot of reasons why I favor it over other platforms.  It isn’t just because it is worked I’ve worked with (for the record, I’ve put some serious time in Oracle as well).  There are really three main reasons why I will promote SQL Server as the database I think people should work with.

Maturity

Let’s face it, SQL Server has been around for a while and Microsoft has had a lot of time to refine it.  Over the past 15 year I’ve worked with it, I’ve seen the addition of lots of neat features that enhance the core RDBMS offering.  At the same time, SQL Server is still a solid relational database and gives users a solid, reliable platform for storing their data.  It’s not perfect and I’ll be the last person to tell you it is, but it certainly is on par with Oracle and PostgreSQL.

Adaptability

Microsoft has seen the writing on the wall.  Some of it is from their own hand, some of it is how the data world is evolving.  However, “the cloud”, in memory structures, and big data are ubiquitous in today’s tech landscape.  Looking at the recent version of SQL Server, it’s apparent that Microsoft is trying to mold the product to live in this new realm.  Consider Hekaton, the ability to span databases between Azure and on-premise, and improvements to columnstore (along with other updates). Microsoft is making investments to keep pace with the changes we’re seeing in the larger technology world and I appreciate the vision they have for the product.

Accessibility

This is the big one for me.  The other two basically tell me that, in going with SQL Server, I’m going to have an established RDBMS platform I can rely on along with Microsoft continuing to improve things to keep pace with other products.  What sets SQL Server apart is the fact that it’s so much easier to work with, for both new folks and seasoned professionals.

First, let’s look at the fact that it’s Windows.  Now while we all lament SQL Server’s default settings, the fact is that almost anyone with minimal experience can get their own SQL Server instance up and running in short order.  This means that the door is open a little wider for people wanting to get into the database world than those who don’t have supporting skill sets for Linux or hacking the registry.  SQL Server ships with wizards and graphical tools to get folks going.  Just make sure you talk to a professional before getting to far.  :)

And that’s the second thing to talk about.  Maybe I’m biased because I’ve been involved in the SQL Server community for so long, but I’m continually amazed by the amount of free material for training and best practices provided by this community, from blogs to Twitter to a user group likely nearby where you can ask questions of people using SQL Server in your area.  It’s so easy to get started with SQL Server.

Yeah, I know I sound like a fanboy at this point (squee!).  Just so we’re on the level, I am well aware of SQL Server’s flaws.  There’s a lot of things that aren’t perfect or were added a couple versions ago but were never finished up (*cough* Management Data Warehouse).  And let’s not get into what’s in Standard Edition versus Enterprise.  Trust me, I get it.  Even with that, though, I feel that SQL Server is the preferred offering at this point for companies that are looking for a solid relational platform. 

2013 in review, 2014 in anticipation

As is my custom, I want to blog about what I did last year and my goals for the upcoming year.  Since I have more “regular” stuff to blog about, I figured I’d sneak this in on a Friday just to get it up here.  :)  The primary reason I do this is that which gets measured, gets done (hat tip Buck Woody).  2013 was a pretty eventful year, but I didn’t quite get everything I had wanted to.  So let’s look back at…

2013

Speaking – Did pretty well here.  I met all my goals as well as actually speaking at the 2013 Summit.  I love presenting and now it’s a matter of building new presentations.

Blogging  - Not quite an epic fail, but this one was not a passing grade.  I basically blogged regularly through to about March, then fell back to my sporadic pace.  I didn’t even blog for about 3 months over the summer.  I’m very disappointed with myself and can do better here.

Certification – This was a success and was a lot easier than I had anticipated.  I sat down to take four of the five exams cold to evaluate where I stood and ended up passing them.  The one I had to study for (and I knew this going in) was the Data Warehousing exam (70-463).  Pretty happy here, but now it’s a matter of finding other options for certification.

I know a lot of folks have a dim view of certs.  I agree with the premise, that the exams and certifications (mostly) are no substitute for real world experience.  The fact that they can be crammed for removes some of the validity.  But at the same time, I think that they can be a valuable barometer for someone’s knowledge.  I fall in Glenn Berry’s camp here, that certifications are like extra credit.  If I see a cert on a resume, I won’t assume knowledge of the subject matter, but it at least shows me the candidate was willing to make an extra effort to improve themselves.

Secret Project – Ok, this more or less flopped.  Basically, I wanted to either write a book or build a pre-con for database monitoring because I feel there’s a real gap out there.  These are still on my slate of things I want to do, but I did not get that done in 2013.  Boo me.

And that was 2013.  I’d give myself a B for the year, mostly because while I whiffed on some of my goals, I also managed to go beyond on a few things.  In addition, I’ve made some career choices and moved things in that direction.  Which now brings me to….

2014

Several of the goals will be maintenance or the same.  I want to maintain on the presenting, so we’re looking at at least 4 SQL Saturdays and a Summit submission.  Blogging needs to be once a week.  This is a real stretch goal, because it’s harder than it looks, but I need to strive for it.  These two items are the base foundation my career grows from, so I need to keep that foundation healthy.

New stuff for 2014 will be:

  • I’m trying to move my career more towards being a data architect.  So for that goal, I will write more posts on general data concepts and higher level planning.  There will still be plenty of administrative stuff, but I need to broaden my blogging scope.

  • I also need to start reading more on cloud computing, modelling, and larger design concepts.  Right now I’m reading Domain Driven Design and my goal is to read 1 book a month on a related topic.

  • I will identify and start working towards a new certification path for design and architecture.  Not sure what this is, but I’ll have something identified by the end of March and start working on it the rest of the year.

2013 was a year of change in my life.  Not earth shattering change, not seismic shifts, but a definite redirection in my aims.  2014 will be solidifying my plan and starting down that path.  My biggest challenge will be sorting out the things that are new and uncomfortable from those that are the wrong direction.  The question I will continue to ask is “Does this move me in the direction I want to go?”.

 

Divide and Conquer

So far, I’ve talked about NoSQL’s use of key-value pairs for storage along with its partitioned nature.  These two concepts are useful, but you’ve probably noticed that it can make querying a lot more difficult, particularly for aggregating data.  That’s where the third and final concept that I want to discuss comes in to play, the process that non-relational data stores use to analyze all the data it has stored across its nodes: MapReduce.

At the core, MapReduce isn’t some wild concept.  The idea is to marshall all of our compute resources to analyze and aggregate data.  Simply put, there are two phases that a non-relational system will use to aggregate data:

  • Map – On each individual node, collect and aggregate the data that matches our predicate.  Pass this to a higher node.

  • Reduce – Aggregate on a “master” node the values passed from the individual nodes.  Once the work from all the individual nodes has been aggregated, we have our answer.

As with most of this NoSQL stuff, it is not rocket science.  It does, however, make more sense if we visualize it.  Let’s say, for example, we wanted a count of all our sales orders within a sales territory and we had a 3-node NoSQL cluster (platform is irrelevant).  A basic MapReduce tree would look something like this:

map-reducesimple

This isn’t all that much different than how SQL Server (or another RDBMS, for that matter) would process it.  If we take the same concept and apply to a basic query from AdventureWorks, we can see the SQL approach to it:

select count(1) from sales.SalesOrderHeader
where TerritoryID = 3

map-reduceqp

You can see from this basic query plan that the only real difference here is that the mapping piece happens on a single node (your instance) rather than multiple nodes.  But otherwise, the process of gathering all the data we need to calculate and then calculating our total aggregate is the same.  It’s the idea of scaling this out over multiple compute nodes that is the major difference.

And scaling is huge.  3-nodes is a piece of cake, but what if we had 10?  Or 30?  Or 100?  Again, the idea is that non-relational data stores can handle terabytes and petabytes of data and you can scale out horizontally to meet those needs.  When we do this, our MapReduce tree can scale to our needs, so we might get something like this:

map-reducefull

Notice the multiple reduce steps.  Just as we can divide and conquer horizontally and collect data across that scale, we can also also do vertical scaling and have multiple reduce phases to further scale the load.  It’s all part of the massively parallel processing model that NoSQL promotes.

There are disadvantages to this, but most of them boil down to the implementation.  Primarily it’s driven around the need in many NoSQL platforms to write your own MapReduce functions using APIs or embedded languages (like JavaScript or platform specific functions).  May platforms are working on SQL-like alternatives to ease this learning curve, but there’s still a lot of growth and maturity needed here and I’m not sure how standardized any of these options really are.

Of course, these MapReduce queries aren’t necessarily going to smoke an RDBMS in terms of performance.  It’s all a matter of design and how you apply the tool.  MapReduce processes are designed, primarily, for batch aggregation of massive amounts of data.  Quantities of data that may not necessarily make sense to store in an RDBMS structure.  As we always say in the database admin world, it depends.  The goal here is to understand how this query process works so we can properly apply it to our own implementations.

Wrapping Up

Hopefully this three-parter has been helpful and educational for you.  I’ve certainly learned a lot about non-relational databases over the past 6 months and my eyes have been opened to new ways to manage data.  That’s really what we’re all here for, anyway.  As the world has more and more data to manage, we as data professionals need to find ways to store it, manage it, and make it useful to those who need it.  NoSQL is just another tool in our toolbox, a hammer to go with our RDBMS screwdriver.  Both valuable, if we use them appropriately.

Before I move on, I did want to share some resources I’ve found helpful in getting my head around non-relational databases:

  • Seven Databases in Seven Weeks - I’m actually not done with this, but it’s a great book for getting your feet wet with several different types of databases.  Be warned, this is not for the faint of heart and there’s a LOT you’re going to have to figure out on your own, but it will guide you through some basic concepts and use cases.

  • Martin Fowler’s Introduction to NoSQL – This is a great hour long presentation that talks about a lot of what I’ve covered here from a different angle.  It’s got great information and explains things at a good level.

  • Ebay’s Tech Blog on Cassandra Data Modelling – Getting my head around how you model data in a Big Table structure was really, really hard.  This two parter will help you figure this out.

I also want to thank Buck Woody(@BuckWoody) for opening my eyes to the world of the cloud and big(ger) data.  It was partly from a conversation with him that I went down this path and the learning that has come out of it for me has been tremendous.

Sharing the Load

Businesses have always struggled with the high cost of relational database servers, both from a hardware and licensing perspective, because current RDBMS platforms take the approach that you will have one or two powerhouse servers handling your data workload.  While this has been the commonly accepted practice, the evolution of the cloud and better distributed computing architectures have given rise to new approaches for data storage and processing.  Now businesses are rethinking their approach to infrastructure with these new approaches, finding ways to provide better performance and higher availability all at a lower price point.

Non-relational data stores embrace this new approach with how they cluster.  Designed in response to the single, super-pricey database server, non-relational systems take the approach of clustering many inexpensive servers into one database.  This is sold as having many advantages to businesses, such as:

  • Horizontal scalability, allowing business to better adapt to growing workloads.  Need more power? Just add another commodity hardware server, you’re not trapped by the box you bought last year.

  • Distributed data across many machines, making the database more available and increasing the durability against failure.  Think of it as RAID for your data, so that if one box goes down, the system is still up and you just need to replace the one failed box.

  • Costs are lower and more manageable.  Well, at least more manageable, where you can buy only the computing power you need and grow incrementally over time.  However, there’s a lot of factors (from virtualization to overall hardware costs) that make the total price point versus a beefy relational server fairly comparable.

It’s certainly a departure from how we’ve dealt with relational platforms to date (though I’m super intrigued by Google’s F1 database).  There are some definite advantages, but most tech people look at this and say “Well, that sounds fancy, but does it blend?”  And by blend, we mean work.  The answer has actually been around for some time now and many large businesses already use it in their relational databases:  partitioning.

Partitioning is a concept we’ve had in the relational world for some time now (and feel free to look at my partitioning posts for more info).  Even before the major RDBMS platforms introduced their own partitioning functionality, DBAs would partition data out into different tables and bring them together as a view (commonly known as partitioned views).  What the non-relational folks do is take it a step further, saying that instead of just partitioning your data across different disks, why not distribute it across different computers?  By doing that, businesses not only get to spread the workload across all these computers, but their data availability is no longer dependent on a single OS/application instance.

Of course, what this means is that non-relational data stores still need some way to determine how data is partitioned.  Remember last week when I talked about all non-relational stores being key/value stores?  Well, those keys become even more important because they become the partitioning key (also called a shard key, by some), the value that non-relational platforms use to divvy up the data between its partitions.  Now, most of these datastores will analyze the key values and attempt to balance that data as evenly as possible, but it’s something you need to be aware of when modelling out your data.  A poorly chosen partition key could mean you’re negating your cluster advantage because your data will end up in one or two nodes.  We’ll come back to this in a bit.

At this point we should talk a bit about the CAP Theorem.  This bit of computer science was introduced back in 2000 and conceptually defines the limitations of a distributed computing system.  The gist of it is that, while we’re trying to accomplish all these wonderful things, any clustered platform cannot guarantee:

  • Consistency (That all the data on all the nodes is the same)

  • Available (That all the data is the same on all the nodes)

  • Partition Tolerant (That the application will remain up despite the loss of one of its partitions)

Commonly, you’ll hear people say that you can only have two out of these three things at any time.  What this means to non-relational platforms is that, since they’re all designed to be partition tolerant with their clustered nodes, we have to chose between being consistent or available.

This is where eventual consistency comes in to play.  See, the problem with our clusters is that we still have the laws of physics involved.  When data is written and that data is distributed across multiple nodes, then there is a matter of time where data has to be communicated across a network and written to disks in different computers.  Most non-relational systems take the availability path here, meaning that data will be offered up from the various nodes, even if it’s not all up to date.  The idea is that it will be eventually consistent, but you might have to wait a bit.

Now our relational brains might kick back on us.  “Shouldn’t our data be 100% accurate all the time!?!?”  ACID compliance, right?  It depends!  Once again, we have to ask ourselves what data is being stored in our database.  Much of this detail only needs to be accurate at a point in time within an application, once it gets into the database we have a little more time before we have to deal with it.  If not, you might consider using a relational database for your data.  It’s all about using the right tool for the job.

One big concern coming out of this distributed model is the question of backups.  Non-relational folks will tell you that backups for disaster recovery aren’t really necessary because you have it built in.  However, most DBAs I know (and I’m one of them) will tell you that disaster recovery and high availability are not the same thing.  And because of the distributed nature, backups become really complex.  So you have to ask yourself when designing or implementing one of these solutions, how will you handle this?  In some cases, it might be a simple matter of backing up the files, but my research so far has shown that this requires an operating system file lock in order to keep things consistent.  Which means you stop your application (i.e., no hot database backups).  There might be other alternatives, either available or in development (the vendors of these platforms recognize the gap), but be aware that it will be a question you have to address.

The key takeaway for this post is that the distributed nature of non-relational systems is a huge advantage for them, but it has some hurdles to overcome.  It’s simply not a magic bullet to all your data problems.  Again, the theme here is using the appropriate tools for the appropriate purposes.  Recognizing the strengths and weakness is key to knowing when you should use a tool and, because NoSQL is a hot trend, it’s vital to understand it in order to properly guide your company’s conversations around the technology.

Next up, I want to talk about how non-relational systems leverage this dispersed architecture for querying and analyzing data.  It’s a strong offering, but not always fully understood.  Hopefully these last two posts are getting you interested in this realm of data management, but remember I’m only touching the surface here.  NoSQL is as deep (though not nearly as mature) as the RDBMS world and there’s a lot more learning than I can provide here.

Bags and Shelves

NoSQL.

Yeah, a lot of people think it’s a dirty word.  Certainly, many of the DBAs I’ve talked to look down their noses at NoSQL and scoff at the idea of a non-relational database.  Full disclosure, I’ve counted myself amongst that number for a long while, at least until recently.

I’ve had heavy involvement at my current gig with non-relational data stores recently.  This has forced me to adapt and “break” the mold on how I think about data and data management.  It has been an interesting couple of months with lots of great learning (and kicking rust off of some of my lesser used skill sets) and, while I’m no where close to being an expert in non-relational data systems, I’ve definitely come around to their benefits and wanted to share my thoughts so as to hopefully help others bridge the gap

The first thing to keep in mind is that were still managing data.  We’re storing information, it’s just an approach that is foreign to many of us trained with years of relational theory.  The bonus, though, is that this is a far simpler method than we’re used to.  Non-relational data stores center on storing key/value pairs, simple structured arrays that have some ID and some value attached to it.  Ask any developer about arrays, and the concept is crystal clear, but for those of us used to talking about keys and constraints, it sounds messy and disorganized.  It’s a common disconnect between the worlds of code and database development, known as the infamous Object-Relational Impedance Mismatch problem.

What’s cool about a key/value pair approach that we get a lot of flexibility around what we can store, depending on the system.  It could be columns with different data, but it could also be a more loose and flexible structure stored (typically) like a JSON document.  Other stores will allow your values to be PDFs, images, or other BLOB type items.  For developers, these two characteristics make non-relational platforms very attractive, because they are intuitive to work with and don’t tie them down with a lot of rules about how they manage data.

“But wait!” you cry in your strongest E.F. Codd voice, “ I need joins to enforce data integrity!”  After all, this flexibility is great but it can also cause our data to rampantly sprawl, generating a management nightmare.  Bad values, improperly stored information, where does it all end?  This is before we even talk about the kind of performance you would get trying to query anything useful from this data.  And these are all valid concerns…

…if we really were concerned about them.

This is the point where I had to take a step back from my relational view and really think about the data.  We live in a data driven world, where everything (HELLO NSA) is tracked and recorded.  But how much of this is useful?  Do people really care if I made a carry out order for Chinese food last Tuesday at 7:47 PM MDT, or that I made (on average) 10 orders a month from the same Indian restaurant(mmmm….lamb vindaloo)?  Any more, we don’t care so much about the detail, but instead want to analyze trends and patterns in the data as a whole.  We still need the detail because it’s what forms the trends, but because of the simple volume and variety of this detail, there comes a point where handling it in a traditional relational manner is inefficient.  And we don’t need to query the detail fast, we just need to be able to query it periodically to build aggregates for our trend analysis and reporting.

bags_shelvesThis is why I boil down things into two generic categories, which help me understand usefulness and suitability.  What we really have are Bags and Shelves.  Duh, right?  It’s not rocket science, but I think using these analogies help us understand data management a little better.  If we think about non-relational data stores as a bag, it becomes simple.  A bag is flexible, can expand (to a certain point) and has very few restrictions on what you put in it.  However, sometimes getting things out or managing the contents is cumbersome because it’s not very well organized.  Compare this to our nicely organized set of RDBMS shelves.  Everything is neatly classified, well organized, and easy to find.  However, we’re limited on what we can put on our shelves (did you space everything far enough apart or build it big enough?) and, before we can put anything on those shelves, we have to take time to sort things out.

Thinking about data in this manner really helped me understand that these are two different tools for two different purposes.  Do you have an application that is focused on displaying data, running reports, and doing analysis?  You’re probably best off with an RDBMS where all the data is sorted and organized, with effective indexing and constraints.  If you’re application is focused on taking in a lot of information, a non-relational platform might be more suited to your needs, where you don’t have to expend the organization efforts that will stand in the way of you processing your data.

This sort of thinking is why I’m becoming a fan of Martin Fowler’s (@martinfowler) idea of Polyglot Persistance.  Again, nothing earth shattering, but the idea is that we have different tools to solve different problems, so we should use those tools appropriately.  This, unfortunately, is where I think the CTO-idea-of-the-month club gets it wrong, because they want to sell you NoSQL as a replacement, not as a complement.  For whatever reason, people cling to “here’s my hammer and every problem is a nail” mentality, even when it comes to data management.

And this is where we come in.  As data professionals, we need to embrace non-relational data stores as a new tool in our toolbox, not pooh-pooh it as a fad.  There’s some real value there, value I will talk more on in upcoming blog posts.  As I often say about pop artists, they must be doing something right considering how successful they’ve been.  We need to key in on what that right-ness is, understand it, and embrace it so we can guide organizations in effectively managing their data.

Data.  It’s all about the data, and as data professionals our most important job is managing the data.

As you can probably tell, I’ve only scratched the surface here.  There’s so much to the world of non-relational datastores, I can’t get all my thoughts out in one post.  In the next post, I’ll share with you what I’ve learned about the much touted horizontal scalability and some of the concepts wrapped up in that.

Lines In The Sand

When working with databases, the question of performance always comes up.  Whether it’s a new build or an existing setup, we need some sort of radar gun to measure our instances to make sure they’re up to snuff.  The idea is simple:  We need some sort of benchmark that tells us, at least in a relative sense, how well our systems are performing.

Of course, the tools and approaches available to us seem as numerous as stars in the sky.  Want to measure storage performance?  Take a look at SQLIO or IOmeter.  Need a full end testing suite?  Paid tools like LoadRunner and Benchmark Factory can help you.  Anything else?  A quick Google search will provide plenty of options.  So where do you start?

For my part, I have become a big fan of HammerDB.  Discovered via Kendra Little(@Kendra_Little), HammerDB is a handy tool that allows you to run a TPC-C benchmark against your instances(and not just SQL Server!).  You can specify several different factors to customize your testing, yet the tool has an easy to use interface.  It’s also very well documented, with some good instruction on how to set up your own load tests.

While the HammerDB documentation and Kendra’s post are fairly comprehensive, I did want to share a little on how I setup and run my own HammerDB tests.  This is mostly because there’s a couple gotchas that aren’t covered (after all, the guy who wrote it is an Oracle dude, so I’ll cut him some slack).  First off, let’s talk about the initial schema build.  We can let HammerDB create everything from scratch, but because it uses all of SQL Server’s defaults for that, we get a poorly configured database.  When I am setting up a test (approximately 100 warehouses), I will specifically create an empty database with my files right-sized and in SIMPLE mode, and then use HammerDB to create the schema within that database:

create database tpcc
on primary (name=tpcc_data01, filename='c:\dbdata\tpcc_data01.mdf',size=10gb,maxsize=unlimited,filegrowth=1gb)
log on (name=tpcc_log, filename='c:\dbdata\tpcc_log.ldf',size=4gb,maxsize=unlimited,filegrowth=1gb);

alter database tpcc set recovery simple;

I use SIMPLE mode primarily so I don’t have to bother with log backups on the benchmark database.  Keep in mind this doesn’t actually affect how transactions are processed, only on how they’re removed from the log file.  By properly sizing the log file, we avoid messing around with too many VLFs, which could adversely affect our numbers.

At this point, I’ll kick off the schema build, usually with 10 concurrent users.  It still could take a while to build, so patience is key.  I have had issues where the tool sometimes flubs on finalizing the schema (when creating the indexes and stored procedures), so if you run into that you can find them in a SQL script here.

Once the schema is in place, it’s a simple matter to run the test.  We can run an ad hoc test easily enough, but for a solid benchmark we need use HammerDB’s autopilot functionality.  The autopilot will run sequential timed tests using different concurrent user threads to iteratively increase the load.

HammerDB_DriverScript

First thing to do is make sure, under the Driver Script options, we’ve selected “Timed Test Driver Script”.  Then I tweak a couple things here, mostly around the length of each test.  HammerDB defaults to a ramp up time of 2 minutes, followed by a 5 minute test.  Keep in mind that having a ramp up time is very important, because it allows the system to “warm the cache” and give you more consistent numbers.  That being said, I adjust these values, going with a ramp up of 1 minute and a 5 minute test.  Our schema is not tremendously big (10 GB), so this is usually enough.  It doesn’t hurt to have larger values if you want, but 1 and 5 suit my purposes.  Once this is all set, I load the script and I’m ready to move on to the autopilot settings.

HammerDB_AutopilotOptionsNext, I need to enable the autopilot.  I’ll leave the concurrent user batches at their defaults, as they give you a nice battery of threads for usage.  However, this is where the first gotcha is.  When setting the time for each test, this has to exceed the total of the ramp up time and the test time, plus some time for setting up and tearing down each test.  Since I go with settings that give me a total of 6 minutes for a single test run, I’ll set the test time in the autopilot settings to 8 minutes to give the application time to do all the other setup work it needs.  Having this buffer time doesn’t affect anything with the test itself, but if you cut it to short, the application will essentially discard your results to move on to the next test.  As a nicety, I like to check “Show Virtual User Output” so I can sanity check that everything is running ok during the test.  The info isn’t very informative (like watching the Matrix code), but you’ll at least see errors in the run.

For the tests themselves, I’ll actually run the full autopilot test batch three different times.  I should get results that are relatively the same, but I do want to measure for consistency.  Then I chart out the results, looking at three numbers:

  • Concurrent users (note this is the number in HammerDB minus 1, because HammerDB has one user acting as a controller)
  • Transactions Per Minute (TPM)
  • Transactions Per Minute Per User (TPM/Concurrent Users)

Naturally, we should see the TPM increase as the number of connections rise.  What I’m particularly looking for is to see the behavior curve for TPM/user, which should remain steady.

Lines_Sand_TPM_ChartIf we take a look at this sample from a run on my laptop, you can see what I mean.  It is, of course, not a “good” thing that the average TPM/user drops as connections increase, but this is expected as my poor little laptop isn’t made for enterprise OLTP.  We can at least see that my laptop could handle two concurrent users before falling over on its face.

Some final caveats to keep in mind.  The first is to remember that these numbers are relative.  There’s no “good” value for this, it’s all related to how your systems are performing.  The second caveat is that this is an unofficial TPC-C result.  Vendors and the TPC will only certify something as official under very controlled circumstances, mostly because it’s their reputation on the line.    Finally, these values shouldn’t stand in a vacuum, but measure them alongside other metrics (i.e. CPU, Page Life Expectancy, etc).
That being said, I find HammerDB an effective tool for giving me a general idea of system performance, using a standard that is application agnostic, thorough, and has a proven track record.

The Moving Van

A problem DBAs commonly face is moving database files around, especially when we’re trying to manage space on a drive.  I know that I’ve had to do it and, when you have a database instance with hundreds of databases, you need an automated way to manage this.  Recently, someone on DBA Stackexchange had this specific problem and I was able to provide a Powershell script that I’ve used in the past to manage this problem.  And when I haven’t blogged for a while, a good way back into it is to share a Powershell script.  Mostly because it’s easy.  :)

First, let me talk about a pattern I commonly use.  A lot of times, I could write a single script to take an action, but instead I build a script to write a script.  The main reason I do this is that it’s very hard, especially with many of our administrative tasks, to write a one-size-fits-all process.  So instead, I have a tool that gets me 90% of the way which I can then tweak or alter that script to address that final 10% to fit that specific need.

That’s what this script is doing.  It doesn’t perform the actual file moves itself, but writes out a script that will execute those files moves.  It’s fairly generic, creating a script that moves all the database files from their current location (wherever that may be) to two common directories: data and log. It’s rare that you would move all your database files in this fashion, but you can edit the generated script to suit your specific situation.

This script makes use of the techniques detailed here by Ben Miller(@dbaduck) with one significant exception. I’ve found that the .Offline() method for databases is very unreliable because you’re basically executing your normal “ALTER DATABASE [foo] SET OFFLINE;”. The main problem with this command is it has to wait for all connections to close gracefully before it can proceed, something that doesn’t happen to frequently. I prefer to use “ALTER DATABASE [foo] SET OFFLINE WITH ROLLBACK IMMEDIATE;” because it forces all connections to close and open transactions to rollback, no waiting involved. Since the SMO methods don’t have an option for that,  I call the SQL using the Invoke-SQLCmd cmdlet. I find this is a much cleaner option for my script.

Specifically, the script accepts 4 parameters:

  • $newdata – New data file path

  • $newlog – New log file path

  • $instance – SQL instance you want to move files for (defaults to local)

  • $outputfile – Output file name (defaults to Documents, FileMover.ps1)

When you run it, it will spit out a FileMover.ps1 script to you Documents folder (though, using that last parameter, you can change the output location and name).  This FileMover.ps1 will be the script you can then edit to suit your specific needs.  The script is below, but you can also download it here.

<#
.SYNOPSIS
   Builds a Powershell script for moving many database files from one
	 set of directories to another.
.DESCRIPTION
   Used to generate a Powershell script for moving many database files
	 from one location to another.  This will typically be used with a
	 large number of databases that need to be relocated.  Caveats for
	 this script include:
	 		-Destination for all files will be the same.
			-User that runs the script must have access to source and destination locations
			-This uses the 2008 R2 version of the SMO.

	The script generates a FileMover.ps1 script (by default in My
	Documents).  The reason for generating a separate script is so
	specific configurations can be edited/updated before actually
	execyting the move.

.PARAMETER <paramName>
   instance - Instance owning the databases to be moved
	 newdata - New data file location, no trailing '\'.  example: "C:\DBFiles\Data"
	 newlog - New log file location, no trailing '\'.  example: "C:\DBFiles\Log"
	 $outputfile - Full path and name of output file.  By default, FileMover.ps1 in My Documents.

.EXAMPLE
   .\Build-FileMover.ps1 -newdata "C:\DBFiles\Data" -newlog "C:\DBFiles\Log"
#>

param([parameter(Mandatory=$true)][string] $newdata,
			[parameter(Mandatory=$true)][string] $newlog,
			[string] $instance="localhost",
      [string] $outputfile=([Environment]::GetFolderPath("MyDocuments"))+"`\FileMover.ps1")

#load SMO
Add-PSSnapin SqlServerCmdletSnapin100
Add-PSSnapin SqlServerProviderSnapin100
[System.Reflection.Assembly]::LoadWithPartialName('Microsoft.SqlServer.SMO') | out-null

#Create server object and output filename
$server = New-Object -TypeName Microsoft.SqlServer.Management.Smo.Server $instance

#get your databases
$db_list=$server.Databases

#build initial script components
"Add-PSSnapin SqlServerCmdletSnapin100" > $outputfile
"Add-PSSnapin SqlServerProviderSnapin100" >> $outputfile
"[System.Reflection.Assembly]::LoadWithPartialName('Microsoft.SqlServer.SMO') `"$instance`" | out-null" >> $outputfile
"`$server = New-Object -TypeName Microsoft.SqlServer.Management.Smo.Server " >> $outputfile

foreach($db_build in $db_list)
{
	#only process user databases
	if(!($db_build.IsSystemObject))
	{
		#script out all the file moves
		"#----------------------------------------------------------------------" >> $outputfile
		"`$db=`$server.Databases[`""+$db_build.Name+"`"]" >> $outputfile

		$dbchange = @()
		$robocpy =@()
		foreach ($fg in $db_build.Filegroups)
		{
			foreach($file in $fg.Files)
			{
				$shortfile=$file.Filename.Substring($file.Filename.LastIndexOf('\')+1)
				$oldloc=$file.Filename.Substring(0,$file.Filename.LastIndexOf('\'))
				$dbchange+="`$db.FileGroups[`""+$fg.Name+"`"].Files[`""+$file.Name+"`"].Filename=`"$newdata`\"+$shortfile+"`""
				$robocpy+="ROBOCOPY `"$oldloc`" `"$newdata`" $shortfile /copyall /mov"

			}
		}

		foreach($logfile in $db_build.LogFiles)
		{
			$shortfile=$logfile.Filename.Substring($logfile.Filename.LastIndexOf('\')+1)
			$oldloc=$logfile.Filename.Substring(0,$logfile.Filename.LastIndexOf('\'))
			$dbchange+="`$db.LogFiles[`""+$logfile.Name+"`"].Filename=`"$newlog`\"+$shortfile+"`""
			$robocpy+="ROBOCOPY `"$oldloc`" `"$newlog`" $shortfile"
		}

		$dbchange+="`$db.Alter()"
		$dbchange+="Invoke-Sqlcmd -Query `"ALTER DATABASE ["+$db_build.Name+"] SET OFFLINE WITH ROLLBACK IMMEDIATE;`" -ServerInstance `"$instance`" -Database `"master`""

		$dbchange >> $outputfile
		$robocpy >> $outputfile

		"Invoke-Sqlcmd -Query `"ALTER DATABASE ["+$db_build.Name+"] SET ONLINE;`" -ServerInstance `"$instance`" -Database `"master`""  >> $outputfile
	}
}