Yeah, a lot of people think it’s a dirty word. Certainly, many of the DBAs I’ve talked to look down their noses at NoSQL and scoff at the idea of a non-relational database. Full disclosure, I’ve counted myself amongst that number for a long while, at least until recently.
I’ve had heavy involvement at my current gig with non-relational data stores recently. This has forced me to adapt and “break” the mold on how I think about data and data management. It has been an interesting couple of months with lots of great learning (and kicking rust off of some of my lesser used skill sets) and, while I’m no where close to being an expert in non-relational data systems, I’ve definitely come around to their benefits and wanted to share my thoughts so as to hopefully help others bridge the gap
The first thing to keep in mind is that were still managing data. We’re storing information, it’s just an approach that is foreign to many of us trained with years of relational theory. The bonus, though, is that this is a far simpler method than we’re used to. Non-relational data stores center on storing key/value pairs, simple structured arrays that have some ID and some value attached to it. Ask any developer about arrays, and the concept is crystal clear, but for those of us used to talking about keys and constraints, it sounds messy and disorganized. It’s a common disconnect between the worlds of code and database development, known as the infamous Object-Relational Impedance Mismatch problem.
What’s cool about a key/value pair approach that we get a lot of flexibility around what we can store, depending on the system. It could be columns with different data, but it could also be a more loose and flexible structure stored (typically) like a JSON document. Other stores will allow your values to be PDFs, images, or other BLOB type items. For developers, these two characteristics make non-relational platforms very attractive, because they are intuitive to work with and don’t tie them down with a lot of rules about how they manage data.
“But wait!” you cry in your strongest E.F. Codd voice, “ I need joins to enforce data integrity!” After all, this flexibility is great but it can also cause our data to rampantly sprawl, generating a management nightmare. Bad values, improperly stored information, where does it all end? This is before we even talk about the kind of performance you would get trying to query anything useful from this data. And these are all valid concerns…
…if we really were concerned about them.
This is the point where I had to take a step back from my relational view and really think about the data. We live in a data driven world, where everything (HELLO NSA) is tracked and recorded. But how much of this is useful? Do people really care if I made a carry out order for Chinese food last Tuesday at 7:47 PM MDT, or that I made (on average) 10 orders a month from the same Indian restaurant(mmmm….lamb vindaloo)? Any more, we don’t care so much about the detail, but instead want to analyze trends and patterns in the data as a whole. We still need the detail because it’s what forms the trends, but because of the simple volume and variety of this detail, there comes a point where handling it in a traditional relational manner is inefficient. And we don’t need to query the detail fast, we just need to be able to query it periodically to build aggregates for our trend analysis and reporting.
This is why I boil down things into two generic categories, which help me understand usefulness and suitability. What we really have are Bags and Shelves. Duh, right? It’s not rocket science, but I think using these analogies help us understand data management a little better. If we think about non-relational data stores as a bag, it becomes simple. A bag is flexible, can expand (to a certain point) and has very few restrictions on what you put in it. However, sometimes getting things out or managing the contents is cumbersome because it’s not very well organized. Compare this to our nicely organized set of RDBMS shelves. Everything is neatly classified, well organized, and easy to find. However, we’re limited on what we can put on our shelves (did you space everything far enough apart or build it big enough?) and, before we can put anything on those shelves, we have to take time to sort things out.
Thinking about data in this manner really helped me understand that these are two different tools for two different purposes. Do you have an application that is focused on displaying data, running reports, and doing analysis? You’re probably best off with an RDBMS where all the data is sorted and organized, with effective indexing and constraints. If you’re application is focused on taking in a lot of information, a non-relational platform might be more suited to your needs, where you don’t have to expend the organization efforts that will stand in the way of you processing your data.
This sort of thinking is why I’m becoming a fan of Martin Fowler’s (@martinfowler) idea of Polyglot Persistance. Again, nothing earth shattering, but the idea is that we have different tools to solve different problems, so we should use those tools appropriately. This, unfortunately, is where I think the CTO-idea-of-the-month club gets it wrong, because they want to sell you NoSQL as a replacement, not as a complement. For whatever reason, people cling to “here’s my hammer and every problem is a nail” mentality, even when it comes to data management.
And this is where we come in. As data professionals, we need to embrace non-relational data stores as a new tool in our toolbox, not pooh-pooh it as a fad. There’s some real value there, value I will talk more on in upcoming blog posts. As I often say about pop artists, they must be doing something right considering how successful they’ve been. We need to key in on what that right-ness is, understand it, and embrace it so we can guide organizations in effectively managing their data.
Data. It’s all about the data, and as data professionals our most important job is managing the data.
As you can probably tell, I’ve only scratched the surface here. There’s so much to the world of non-relational datastores, I can’t get all my thoughts out in one post. In the next post, I’ll share with you what I’ve learned about the much touted horizontal scalability and some of the concepts wrapped up in that.