Art of the DBA Rotating Header Image

NoSQL

Divide and Conquer

So far, I’ve talked about NoSQL’s use of key-value pairs for storage along with its partitioned nature.  These two concepts are useful, but you’ve probably noticed that it can make querying a lot more difficult, particularly for aggregating data.  That’s where the third and final concept that I want to discuss comes in to play, the process that non-relational data stores use to analyze all the data it has stored across its nodes: MapReduce.

At the core, MapReduce isn’t some wild concept.  The idea is to marshall all of our compute resources to analyze and aggregate data.  Simply put, there are two phases that a non-relational system will use to aggregate data:

  • Map – On each individual node, collect and aggregate the data that matches our predicate.  Pass this to a higher node.

  • Reduce – Aggregate on a “master” node the values passed from the individual nodes.  Once the work from all the individual nodes has been aggregated, we have our answer.

As with most of this NoSQL stuff, it is not rocket science.  It does, however, make more sense if we visualize it.  Let’s say, for example, we wanted a count of all our sales orders within a sales territory and we had a 3-node NoSQL cluster (platform is irrelevant).  A basic MapReduce tree would look something like this:

map-reducesimple

This isn’t all that much different than how SQL Server (or another RDBMS, for that matter) would process it.  If we take the same concept and apply to a basic query from AdventureWorks, we can see the SQL approach to it:

select count(1) from sales.SalesOrderHeader
where TerritoryID = 3

map-reduceqp

You can see from this basic query plan that the only real difference here is that the mapping piece happens on a single node (your instance) rather than multiple nodes.  But otherwise, the process of gathering all the data we need to calculate and then calculating our total aggregate is the same.  It’s the idea of scaling this out over multiple compute nodes that is the major difference.

And scaling is huge.  3-nodes is a piece of cake, but what if we had 10?  Or 30?  Or 100?  Again, the idea is that non-relational data stores can handle terabytes and petabytes of data and you can scale out horizontally to meet those needs.  When we do this, our MapReduce tree can scale to our needs, so we might get something like this:

map-reducefull

Notice the multiple reduce steps.  Just as we can divide and conquer horizontally and collect data across that scale, we can also also do vertical scaling and have multiple reduce phases to further scale the load.  It’s all part of the massively parallel processing model that NoSQL promotes.

There are disadvantages to this, but most of them boil down to the implementation.  Primarily it’s driven around the need in many NoSQL platforms to write your own MapReduce functions using APIs or embedded languages (like JavaScript or platform specific functions).  May platforms are working on SQL-like alternatives to ease this learning curve, but there’s still a lot of growth and maturity needed here and I’m not sure how standardized any of these options really are.

Of course, these MapReduce queries aren’t necessarily going to smoke an RDBMS in terms of performance.  It’s all a matter of design and how you apply the tool.  MapReduce processes are designed, primarily, for batch aggregation of massive amounts of data.  Quantities of data that may not necessarily make sense to store in an RDBMS structure.  As we always say in the database admin world, it depends.  The goal here is to understand how this query process works so we can properly apply it to our own implementations.

Wrapping Up

Hopefully this three-parter has been helpful and educational for you.  I’ve certainly learned a lot about non-relational databases over the past 6 months and my eyes have been opened to new ways to manage data.  That’s really what we’re all here for, anyway.  As the world has more and more data to manage, we as data professionals need to find ways to store it, manage it, and make it useful to those who need it.  NoSQL is just another tool in our toolbox, a hammer to go with our RDBMS screwdriver.  Both valuable, if we use them appropriately.

Before I move on, I did want to share some resources I’ve found helpful in getting my head around non-relational databases:

  • Seven Databases in Seven Weeks – I’m actually not done with this, but it’s a great book for getting your feet wet with several different types of databases.  Be warned, this is not for the faint of heart and there’s a LOT you’re going to have to figure out on your own, but it will guide you through some basic concepts and use cases.

  • Martin Fowler’s Introduction to NoSQL – This is a great hour long presentation that talks about a lot of what I’ve covered here from a different angle.  It’s got great information and explains things at a good level.

  • Ebay’s Tech Blog on Cassandra Data Modelling – Getting my head around how you model data in a Big Table structure was really, really hard.  This two parter will help you figure this out.

I also want to thank Buck Woody(@BuckWoody) for opening my eyes to the world of the cloud and big(ger) data.  It was partly from a conversation with him that I went down this path and the learning that has come out of it for me has been tremendous.

Sharing the Load

Businesses have always struggled with the high cost of relational database servers, both from a hardware and licensing perspective, because current RDBMS platforms take the approach that you will have one or two powerhouse servers handling your data workload.  While this has been the commonly accepted practice, the evolution of the cloud and better distributed computing architectures have given rise to new approaches for data storage and processing.  Now businesses are rethinking their approach to infrastructure with these new approaches, finding ways to provide better performance and higher availability all at a lower price point.

Non-relational data stores embrace this new approach with how they cluster.  Designed in response to the single, super-pricey database server, non-relational systems take the approach of clustering many inexpensive servers into one database.  This is sold as having many advantages to businesses, such as:

  • Horizontal scalability, allowing business to better adapt to growing workloads.  Need more power? Just add another commodity hardware server, you’re not trapped by the box you bought last year.

  • Distributed data across many machines, making the database more available and increasing the durability against failure.  Think of it as RAID for your data, so that if one box goes down, the system is still up and you just need to replace the one failed box.

  • Costs are lower and more manageable.  Well, at least more manageable, where you can buy only the computing power you need and grow incrementally over time.  However, there’s a lot of factors (from virtualization to overall hardware costs) that make the total price point versus a beefy relational server fairly comparable.

It’s certainly a departure from how we’ve dealt with relational platforms to date (though I’m super intrigued by Google’s F1 database).  There are some definite advantages, but most tech people look at this and say “Well, that sounds fancy, but does it blend?”  And by blend, we mean work.  The answer has actually been around for some time now and many large businesses already use it in their relational databases:  partitioning.

Partitioning is a concept we’ve had in the relational world for some time now (and feel free to look at my partitioning posts for more info).  Even before the major RDBMS platforms introduced their own partitioning functionality, DBAs would partition data out into different tables and bring them together as a view (commonly known as partitioned views).  What the non-relational folks do is take it a step further, saying that instead of just partitioning your data across different disks, why not distribute it across different computers?  By doing that, businesses not only get to spread the workload across all these computers, but their data availability is no longer dependent on a single OS/application instance.

Of course, what this means is that non-relational data stores still need some way to determine how data is partitioned.  Remember last week when I talked about all non-relational stores being key/value stores?  Well, those keys become even more important because they become the partitioning key (also called a shard key, by some), the value that non-relational platforms use to divvy up the data between its partitions.  Now, most of these datastores will analyze the key values and attempt to balance that data as evenly as possible, but it’s something you need to be aware of when modelling out your data.  A poorly chosen partition key could mean you’re negating your cluster advantage because your data will end up in one or two nodes.  We’ll come back to this in a bit.

At this point we should talk a bit about the CAP Theorem.  This bit of computer science was introduced back in 2000 and conceptually defines the limitations of a distributed computing system.  The gist of it is that, while we’re trying to accomplish all these wonderful things, any clustered platform cannot guarantee:

  • Consistency (That all the data on all the nodes is the same)

  • Available (That all the data is the same on all the nodes)

  • Partition Tolerant (That the application will remain up despite the loss of one of its partitions)

Commonly, you’ll hear people say that you can only have two out of these three things at any time.  What this means to non-relational platforms is that, since they’re all designed to be partition tolerant with their clustered nodes, we have to chose between being consistent or available.

This is where eventual consistency comes in to play.  See, the problem with our clusters is that we still have the laws of physics involved.  When data is written and that data is distributed across multiple nodes, then there is a matter of time where data has to be communicated across a network and written to disks in different computers.  Most non-relational systems take the availability path here, meaning that data will be offered up from the various nodes, even if it’s not all up to date.  The idea is that it will be eventually consistent, but you might have to wait a bit.

Now our relational brains might kick back on us.  “Shouldn’t our data be 100% accurate all the time!?!?”  ACID compliance, right?  It depends!  Once again, we have to ask ourselves what data is being stored in our database.  Much of this detail only needs to be accurate at a point in time within an application, once it gets into the database we have a little more time before we have to deal with it.  If not, you might consider using a relational database for your data.  It’s all about using the right tool for the job.

One big concern coming out of this distributed model is the question of backups.  Non-relational folks will tell you that backups for disaster recovery aren’t really necessary because you have it built in.  However, most DBAs I know (and I’m one of them) will tell you that disaster recovery and high availability are not the same thing.  And because of the distributed nature, backups become really complex.  So you have to ask yourself when designing or implementing one of these solutions, how will you handle this?  In some cases, it might be a simple matter of backing up the files, but my research so far has shown that this requires an operating system file lock in order to keep things consistent.  Which means you stop your application (i.e., no hot database backups).  There might be other alternatives, either available or in development (the vendors of these platforms recognize the gap), but be aware that it will be a question you have to address.

The key takeaway for this post is that the distributed nature of non-relational systems is a huge advantage for them, but it has some hurdles to overcome.  It’s simply not a magic bullet to all your data problems.  Again, the theme here is using the appropriate tools for the appropriate purposes.  Recognizing the strengths and weakness is key to knowing when you should use a tool and, because NoSQL is a hot trend, it’s vital to understand it in order to properly guide your company’s conversations around the technology.

Next up, I want to talk about how non-relational systems leverage this dispersed architecture for querying and analyzing data.  It’s a strong offering, but not always fully understood.  Hopefully these last two posts are getting you interested in this realm of data management, but remember I’m only touching the surface here.  NoSQL is as deep (though not nearly as mature) as the RDBMS world and there’s a lot more learning than I can provide here.

Bags and Shelves

NoSQL.

Yeah, a lot of people think it’s a dirty word.  Certainly, many of the DBAs I’ve talked to look down their noses at NoSQL and scoff at the idea of a non-relational database.  Full disclosure, I’ve counted myself amongst that number for a long while, at least until recently.

I’ve had heavy involvement at my current gig with non-relational data stores recently.  This has forced me to adapt and “break” the mold on how I think about data and data management.  It has been an interesting couple of months with lots of great learning (and kicking rust off of some of my lesser used skill sets) and, while I’m no where close to being an expert in non-relational data systems, I’ve definitely come around to their benefits and wanted to share my thoughts so as to hopefully help others bridge the gap

The first thing to keep in mind is that were still managing data.  We’re storing information, it’s just an approach that is foreign to many of us trained with years of relational theory.  The bonus, though, is that this is a far simpler method than we’re used to.  Non-relational data stores center on storing key/value pairs, simple structured arrays that have some ID and some value attached to it.  Ask any developer about arrays, and the concept is crystal clear, but for those of us used to talking about keys and constraints, it sounds messy and disorganized.  It’s a common disconnect between the worlds of code and database development, known as the infamous Object-Relational Impedance Mismatch problem.

What’s cool about a key/value pair approach that we get a lot of flexibility around what we can store, depending on the system.  It could be columns with different data, but it could also be a more loose and flexible structure stored (typically) like a JSON document.  Other stores will allow your values to be PDFs, images, or other BLOB type items.  For developers, these two characteristics make non-relational platforms very attractive, because they are intuitive to work with and don’t tie them down with a lot of rules about how they manage data.

“But wait!” you cry in your strongest E.F. Codd voice, “ I need joins to enforce data integrity!”  After all, this flexibility is great but it can also cause our data to rampantly sprawl, generating a management nightmare.  Bad values, improperly stored information, where does it all end?  This is before we even talk about the kind of performance you would get trying to query anything useful from this data.  And these are all valid concerns…

…if we really were concerned about them.

This is the point where I had to take a step back from my relational view and really think about the data.  We live in a data driven world, where everything (HELLO NSA) is tracked and recorded.  But how much of this is useful?  Do people really care if I made a carry out order for Chinese food last Tuesday at 7:47 PM MDT, or that I made (on average) 10 orders a month from the same Indian restaurant(mmmm….lamb vindaloo)?  Any more, we don’t care so much about the detail, but instead want to analyze trends and patterns in the data as a whole.  We still need the detail because it’s what forms the trends, but because of the simple volume and variety of this detail, there comes a point where handling it in a traditional relational manner is inefficient.  And we don’t need to query the detail fast, we just need to be able to query it periodically to build aggregates for our trend analysis and reporting.

bags_shelvesThis is why I boil down things into two generic categories, which help me understand usefulness and suitability.  What we really have are Bags and Shelves.  Duh, right?  It’s not rocket science, but I think using these analogies help us understand data management a little better.  If we think about non-relational data stores as a bag, it becomes simple.  A bag is flexible, can expand (to a certain point) and has very few restrictions on what you put in it.  However, sometimes getting things out or managing the contents is cumbersome because it’s not very well organized.  Compare this to our nicely organized set of RDBMS shelves.  Everything is neatly classified, well organized, and easy to find.  However, we’re limited on what we can put on our shelves (did you space everything far enough apart or build it big enough?) and, before we can put anything on those shelves, we have to take time to sort things out.

Thinking about data in this manner really helped me understand that these are two different tools for two different purposes.  Do you have an application that is focused on displaying data, running reports, and doing analysis?  You’re probably best off with an RDBMS where all the data is sorted and organized, with effective indexing and constraints.  If you’re application is focused on taking in a lot of information, a non-relational platform might be more suited to your needs, where you don’t have to expend the organization efforts that will stand in the way of you processing your data.

This sort of thinking is why I’m becoming a fan of Martin Fowler’s (@martinfowler) idea of Polyglot Persistance.  Again, nothing earth shattering, but the idea is that we have different tools to solve different problems, so we should use those tools appropriately.  This, unfortunately, is where I think the CTO-idea-of-the-month club gets it wrong, because they want to sell you NoSQL as a replacement, not as a complement.  For whatever reason, people cling to “here’s my hammer and every problem is a nail” mentality, even when it comes to data management.

And this is where we come in.  As data professionals, we need to embrace non-relational data stores as a new tool in our toolbox, not pooh-pooh it as a fad.  There’s some real value there, value I will talk more on in upcoming blog posts.  As I often say about pop artists, they must be doing something right considering how successful they’ve been.  We need to key in on what that right-ness is, understand it, and embrace it so we can guide organizations in effectively managing their data.

Data.  It’s all about the data, and as data professionals our most important job is managing the data.

As you can probably tell, I’ve only scratched the surface here.  There’s so much to the world of non-relational datastores, I can’t get all my thoughts out in one post.  In the next post, I’ll share with you what I’ve learned about the much touted horizontal scalability and some of the concepts wrapped up in that.