Art of the DBA Rotating Header Image


Divide and Conquer

So far, I’ve talked about NoSQL’s use of key-value pairs for storage along with its partitioned nature.  These two concepts are useful, but you’ve probably noticed that it can make querying a lot more difficult, particularly for aggregating data.  That’s where the third and final concept that I want to discuss comes in to play, the process that non-relational data stores use to analyze all the data it has stored across its nodes: MapReduce.

At the core, MapReduce isn’t some wild concept.  The idea is to marshall all of our compute resources to analyze and aggregate data.  Simply put, there are two phases that a non-relational system will use to aggregate data:

  • Map – On each individual node, collect and aggregate the data that matches our predicate.  Pass this to a higher node.

  • Reduce – Aggregate on a “master” node the values passed from the individual nodes.  Once the work from all the individual nodes has been aggregated, we have our answer.

As with most of this NoSQL stuff, it is not rocket science.  It does, however, make more sense if we visualize it.  Let’s say, for example, we wanted a count of all our sales orders within a sales territory and we had a 3-node NoSQL cluster (platform is irrelevant).  A basic MapReduce tree would look something like this:


This isn’t all that much different than how SQL Server (or another RDBMS, for that matter) would process it.  If we take the same concept and apply to a basic query from AdventureWorks, we can see the SQL approach to it:

select count(1) from sales.SalesOrderHeader
where TerritoryID = 3


You can see from this basic query plan that the only real difference here is that the mapping piece happens on a single node (your instance) rather than multiple nodes.  But otherwise, the process of gathering all the data we need to calculate and then calculating our total aggregate is the same.  It’s the idea of scaling this out over multiple compute nodes that is the major difference.

And scaling is huge.  3-nodes is a piece of cake, but what if we had 10?  Or 30?  Or 100?  Again, the idea is that non-relational data stores can handle terabytes and petabytes of data and you can scale out horizontally to meet those needs.  When we do this, our MapReduce tree can scale to our needs, so we might get something like this:


Notice the multiple reduce steps.  Just as we can divide and conquer horizontally and collect data across that scale, we can also also do vertical scaling and have multiple reduce phases to further scale the load.  It’s all part of the massively parallel processing model that NoSQL promotes.

There are disadvantages to this, but most of them boil down to the implementation.  Primarily it’s driven around the need in many NoSQL platforms to write your own MapReduce functions using APIs or embedded languages (like JavaScript or platform specific functions).  May platforms are working on SQL-like alternatives to ease this learning curve, but there’s still a lot of growth and maturity needed here and I’m not sure how standardized any of these options really are.

Of course, these MapReduce queries aren’t necessarily going to smoke an RDBMS in terms of performance.  It’s all a matter of design and how you apply the tool.  MapReduce processes are designed, primarily, for batch aggregation of massive amounts of data.  Quantities of data that may not necessarily make sense to store in an RDBMS structure.  As we always say in the database admin world, it depends.  The goal here is to understand how this query process works so we can properly apply it to our own implementations.

Wrapping Up

Hopefully this three-parter has been helpful and educational for you.  I’ve certainly learned a lot about non-relational databases over the past 6 months and my eyes have been opened to new ways to manage data.  That’s really what we’re all here for, anyway.  As the world has more and more data to manage, we as data professionals need to find ways to store it, manage it, and make it useful to those who need it.  NoSQL is just another tool in our toolbox, a hammer to go with our RDBMS screwdriver.  Both valuable, if we use them appropriately.

Before I move on, I did want to share some resources I’ve found helpful in getting my head around non-relational databases:

  • Seven Databases in Seven Weeks – I’m actually not done with this, but it’s a great book for getting your feet wet with several different types of databases.  Be warned, this is not for the faint of heart and there’s a LOT you’re going to have to figure out on your own, but it will guide you through some basic concepts and use cases.

  • Martin Fowler’s Introduction to NoSQL – This is a great hour long presentation that talks about a lot of what I’ve covered here from a different angle.  It’s got great information and explains things at a good level.

  • Ebay’s Tech Blog on Cassandra Data Modelling – Getting my head around how you model data in a Big Table structure was really, really hard.  This two parter will help you figure this out.

I also want to thank Buck Woody(@BuckWoody) for opening my eyes to the world of the cloud and big(ger) data.  It was partly from a conversation with him that I went down this path and the learning that has come out of it for me has been tremendous.

Parallelism, SQL Server, and you: Round 2

Previously on Art of the DBA: The basics of parallelism.

Control! You must learn control!

So now that we’ve gotten the basics of  parallelism under our belt, it’s time to see how SQL Server uses it. The bad news is that most of the decision making about using parallelism is made by the black box known as the query optimizer. However, there are two parameters in the SQL Engine that we can manipulate to control whether or not a query will be parallelized, both managed in the server configuration.

The first option we can adjust is the cost threshold for parallelism.  Generally speaking, this is the value that SQL Server looks at when evaluating the cost of a query. If the query exceeds that cost, it will be parallelized. Otherwise, it will be run as a single threaded operation. Query cost is the estimated number of seconds a query will run in a single thread and to get an idea of what your query cost is, check your query execution plan. The default value for the cost threshold parameter is 5, so if a query is estimated to take longer than 5 seconds, SQL Server will parallelize it.

The second option is Max Degree of Parallelism, commonly referred to as MAXDOP. This defines the maximum number of threads a parallel query will be given.  The default value for MAXDOP is defaulted to 0, meaning  SQL Server will not limit the number of parallel threads it will create for a query, but it’s usually not the best option. You can find a lot of documentation out on the ‘tubes concerning MAXDOP, but the general rule of thumb is to set it to half the number of processor cores you have, up to 8.

The nice thing about MAXDOP is you can also use it as a query hint.  Using it, you can alter the degree of parallelism for your specific query and not have to worry about mucking with the server setting. Usually you’ll want to make use of this hint to set your max parallelism to 1 if you’re running a long running background task, like maintenance, and you don’t want it to interfere with your normal workload.

Decision Time

Knowing all this, how do we adjust our SQL Server to get the best use out of parallelism? The key is understanding which workloads take advantage of parallelism and which don’t. The first step is figuring out whether your server is an Online Transactional Processing (OLTP) server or a Decision Support System(DSS).

A DSS server, also known as a data warehouse, will benefit most from parallelism. Since your server will usually have a limited number of reporting queries with large datasets, you’ll want that workload spread out to get it processed faster. Also, because you’re dealing with data that is relatively static (it really only changes when you’re processing your warehouse loads), you’re not concerned with holding any locks on your databases, nor do you need to worry about preventing or delaying any data changes from happening while your parallel queries are running.  So parallelism is ideal for devoting as much horsepower to each query as possible.

This is also why parallelism is not a good choice for OLTP systems. Remember, when a CPU execution unit is working on a thread for your query, it won’t be working on anything else. This means that you could be queuing up and delaying important transactions waiting on parallel queries to complete. Since your OLTP system will usually be expected to be processing a lot of small transactions very quickly, a couple parallel queries could really bottleneck your system.  Instead, you want to make as many threads available to your processes as possible and not limit the number of processes you can handle at one time.  This is why you will usually see recommendations for low MAXDOP settings for OLTP, so you can make more CPU execution units available to your total transaction load.

When all is said and done, managing your MAXDOP and cost of parallelism settings are very much part of the DBA art form, where you will need to test and evaluate your workload to find the settings that are appropriate for your server. Here’s a couple steps to get started with on your server:

  1. What kind of queries and how many transactions does my server see? Talk to your developers about the processes and take a look at your Batch Requests/second perfmon counter to help you evaluate this.
  2. Check your current MAXDOP and cost of parallelism settings on your server.  Are they set to the defaults?
  3. What do your CXPACKET waits look like relative to other waits on your server? High CXPACKET waits might be an indication that the MAXDOP on your server is set to high.  To find out your waits, Glenn Berry has an excellent DMV query within his SQL Server 2008 Diagnostic Information Queries that will provide this information for you.

Once you have this information, you can make a plan to adjust your parallelism settings, but you’ll need to test any changes to your settings before implementing them.

Parallelism, while it may not have a tremendous impact on your system, is an important factor nonetheless. Understanding and making proper use of it will help you manage and improve performance on your server.  As an editorial note, I would like to see SQL Server give you some more fine grained control over parallelism, like Oracle does in their query hints.  Hopefully we’ll get to see that in a future release.

Thanks for reading and feel free to leave comments or questions below!