Art of the DBA Rotating Header Image

Fixing the #passvotes problem

Following up on my last post, it looks like PASS has addressed the election issues.  The gist of the fix is:

  • The voting deadline has been extended to October 14.
  • Members have until October 5 to update and validate your account.  Go do this.

My thanks to the PASS Board of Directors to doing the right thing on this.  As an organization, you should always be concerned with the care and service of your current membership.  I still have one question, though:  Are we addressing the symptom or are we fixing the problem?

The last week and the issues that came out of it were a mess and the Board make a great effort to fix it.  But this isn’t a long term fix, this was simply a one time repair of the the problems caused by the change in process.  What I want to know is what PASS is doing to prevent a similar problem next year?  My suggestions to PASS are:

  • You can not rely on email to communicate changes like this.  Make sure you are leveraging other methods (Twitter, blog posts, LinkedIn).
  • If accounts need to be validated, build this validation into the website.  Force members to update/review their profile when they log in.

Now that I’ve been burned by this, I will certainly pay more attention to how PASS is managing my account.  However, PASS needs to think about folks who aren’t as vigilant and be proactive about communicating changes.  Connect, learn, and share isn’t a one way street.

The Great #PASSVotes Cock-Up (of 2014)

I really hate having to return to my blog to write this.  I should be putting up another post about Powershell, SQL Server, and happy life in general.  Alas, this is not the case, and I need to express some thoughts publically regarding the recent PASS Board of Directors vote.  I’ve made quite a bit of noise around this on the Twitterz, so it’s only fair I put out my own big post on this to suffer the slings and arrows of outrageous fortune.  Or some other such nonsense.

For those unaware, this week has the vote for the PASS Board of directors.  Four candidates who are noble in purpose and want to contribute to the SQL Community in a leadership role.  It’s a song and dance most of us have been through before and I applaud these folks for stepping up.  There’s usually a hiccup in the process, but this time around it’s one that is pretty unpleasant in my eyes.

First, let’s talk about the problem.  The SQLPass.org registry is a hot mess.  Folks have multiple accounts, old email addresses, and internet profiles that have been collecting dust for a long while.  The organization has been trying to clean these up (which I’m all in favor of) and this time they said “Everyone should update their profile and we’ll send LOTS of emails about it so they know!”  If you didn’t, well you didn’t get to vote.  You’ll get cut off and have to file again the next time around.

I’m one of those people that missed the email and didn’t update my profile.  The reasons are simple:

  1. I, like many others, get a ton of email.  Lots of it goes to spam.  Things that don’t, I’ll glance at the subject line and only open if it interests me.  PASS sends me a LOT of stuff, most of it I don’t open because I don’t have time.  So I can’t tell you if I skipped it or it got spammed, but I didn’t get the message.
  2. Very little has changed for my profile.  I usually keep my online profiles minimal anyway because, well, spam.  Call it wearing my internet condom, but I only put stuff online that I absolutely must.  The only thing I’d like to update is my account login because the domain is no longer around.  Alas, I can only do that if I send an email to PASS HQ and have the staff intervene behind the scenes.

So there ya’ go, I didn’t update, I don’t get to vote.  Sounds pretty cut and dried, except that I don’t see it that way.  Mostly because of the way PASS decided to handle this, then how they decided to communicate this, and finally how they’ve handled things after the whole thing went up in flames.

Updating your Profile

Let’s talk about the process.  Update your profile?  I mean, I guess, but how many of us do that on a regular basis?  I’ve got so many online profiles I could probably turn them all into sheep and start a mildly profitable wool business in New Zealand. I’m not going around to update these unless I’ve got a reason to.

However, PASS has other ways (or should) to identify my active profile.  I’m the Vice President of Events for the Denver SQL Server Users Group.  I log into the site 3-4 times a month to manage chapter events, send out emails, and generally keep things running.  Are there no logs of my use of the site?  I use the same account (linked with sqlpass.org if you didn’t know) and have used it to register with 6 SQL Saturdays this year.  To insist that a profile update is the only way to know my account was active is silly.

Let’s go with the having to update the profile, though, since that’s what they want.  Since you want folks logging in to the site, can’t you just force them to update their profile on log in?  I have to do it for other websites, it’s not a new concept.  I’d have been annoyed, but by forcing me to complete that, PASS would have accomplished their objective.

Communication

Email.  Always with the email.  With a tech organization, you would expect them to understand the kind of impact email has on our daily lives.  We get tons of it, so we build filters and rules to manage it.  Email is a mess and it’s not getting any better.  (Believe me, I worked for an e-mail marketing firm for a while.)

So why does PASS expect people to always read every single email they send them?  Sorry, I’ve got to much going on.  “But Mike, that’s no excuse!”  Fine, but I expect someone running a tech based community to have enough sense to realize that email shouldn’t be the only means of communication.  Sure, Bill Graziano made a single blog post in January (6 months before the deadline), but no others?  No website notices?  Banners?  As Chris Carter would say “C’mon man!

The Aftermath

So here we are, about 48 hours after this all went nuclear.  There’s been quite the twitter storm and lots of chatter.  I’ve received 4 pieces of communication from PASS personell:

  • 2 form emails from PASS HQ essentially saying “Yeah, you can’t vote”.
  • A direct message from a board member saying “I am fighting to make this right, have patience.”
  • A direct message from a different board member saying “I want to talk about how we should have done this better.

First, the DMs.  To the first, thank you.  This is what we need on the board.  Someone fighting for the community, because without us there is no PASS.  The second…well, I know we need to make things better in the future.  But we also need to make things right.  Now.

The form emails are form emails.  Enough said, we’ve gotten enough crap like that over the years.  Might have well just gone to my spam box.

This is what really angers me.  I’ve been an active member of PASS for 4 years, presenting at SQL Saturdays, spending money out of my own pocket to attend Summit, helping run the Denver users group, and ultimately trying to show people why they need to be in on this.  I’ve been busting tail on behalf of this organization for a while now and it sure as hell feels like I’m being told to “shut up and sit at the back of the bus.”

What also rubs me raw is that we’ve also had a dust-up over the re-branding of PASS to be a more inclusive data organization.  Lets set aside that debate for another time, but right before the vote opened, we had this post on the SQLPass.org blog on how we need to have a bigger umbrella.  Great!  Sure!  On the heels of this we then have an event where many folks (not just me), who have been serving PASS for years, get essentially shoved out from under the umbrella due to a technicality.  I’m pretty sure this wasn’t the intended message, but timing is everything and this one was pretty poor.

No, I don’t think any of this was intentional.  No, I’m not looking for a grand conspiracy to disenfranchise folks.  This was an accident, one that I believe came from poor planning and execution.  Like any mistake, however, the way you recover from it defines you and I’ve got a dim view of how this is being handled post-mistake.  The party line is “Can’t do anything, our hands are tied.”  That’s a line I just can’t swallow, considering the number of long shifts and incident meetings I’ve had to participate in to fix business mistakes.  The right thing can always be done, it takes the will and desire to see that it happens.

TL;DR

I’m calling on PASS to do the right thing.  I’m asking the Board to represent the community that they stand for.  For the third time this year, we the community are calling for transparency and accountability from our organization (The other two?  Rebranding and Speaker/Precon selection for Summit 2014).  If we can’t fix this, we need to know some details around why (disparity of systems why that was chosen, why we are unable to pull additional active voters into the mix, etc.).

To the Board, this is not a personal attack on you.  I know most of you, you are good folks and I like you.  But mistakes are being made and your community needs you.

To everyone else, thanks for reading.  It’s a big wall of text, but one I felt I had to write.    Am I going to get my vote?  Probably not.  Am I giving up on PASS?  Not on your life.  I will continue to tell everyone I can that PASS is the most supportive and open tech community I’ve ever been a part of and will strive to make that so.  Don’t give up faith in the community because, no matter how it gets branded or who’s at the top, we’re the ones that make PASS special.  Connect, learn, and share with those in your office, city, state/province, and country to make PASS what you want it to be.

 

RestoreAutomation #Powershell Module

When you become involved with the SQL community, you see a lot of great tools contributed to make life easier for the rest of us.  The function of these tools range greatly, from helping to analyze your system state, collect data about your enterprise, or performing regular maintenance.  I’ve used many of these tools over the years and can’t imagine being a DBA without them.  Well, now it’s time to give back.

A little over two years ago, I posted a Powershell script for building database restores.  It has been very handy and I’ve used it a lot over the years.  However, it always felt like it could use some polish.  I began work a few months ago to add that polish and was able to debut this output at SQL Saturday Sacramento a few weeks ago.  Now I want to share that same output with you here on my blog.

RestoreAutomation.psm1

I’ve compiled the previous Powershell script into a module, making it easily transportable and allowing DBAs to use it more effectively as a tool.  No messing with script files, all you need to do to use it is download the module file and then import it into your Powershell session.  The file is now available on my new GitHub repository and can be accessed via the following links:

https://github.com/MikeFal/PowerShell/blob/master/RestoreAutomation.psm1
https://raw.githubusercontent.com/MikeFal/PowerShell/master/RestoreAutomation.psm1 (raw version)

To install the module, follow these steps:

  1. Open one of your default Powershell module locations ($env:PSModulePath).  I recommend your user specific one.
  2. Create a folder named RestoreAutomation in that locatoin
  3. Place the RestoreAutomation.psm1 file in the new folder.

Full Microsoft install instructions can be found here.

To load the module once you’ve installed it, just open a Powershell session and enter Import-Module RestoreAutomation.  After that, you can see the available functions by using Get-Command -Module RestoreAutomation, which will provide the following list:

CommandType     Name                ModuleName
-----------     ----                ----------
Function        Get-DBCCCheckDB     RestoreAutomation
Function        Get-Header          RestoreAutomation
Function        Get-RestoreObject   RestoreAutomation
Function        New-Restore         RestoreAutomation
Function        Sync-DBUsers        RestoreAutomation

Using the module

While five functions are listed, two of them are internal and used by the core restore process to generate the necessary code.  I’ll cover the primary three functions here, but I have made full use of Powershell’s comment based help, so you can use Get-Help on these functions at any time to get information on how to use them.

New-Restore
This is the core functionality that was created from my initial script.  When you run this function, it will create a .sql script in your Documents folder to restore the associated files.  The function takes a directory path and then search that directory path for all backup files within it, ordering them appropriately (latest full, then latest differential, then the right sequence of transaction logs).

You’ll see that the function has a lot of parameters.  The only two required are:

  • dir – The directory where the necessary backup files live
  • server – A SQL Server instance where the script generation commands will be run (but not necessarily restored to)

I will not cover the optional parameters in detail here, but review Get-Help New-Restore -full for a full listing of the parameters and what they do.

The function makes the following assumptions:

  • All the backup files in the directory belong to one database only.
  • Backup file types are defined by extension (.bak for full, .dff for differential, and .trn for transaction log).
  • No striped backup files, meaning a database backup will be contained on only one file.
  • No backup sets, meaning a backup file only contains a single database backup.

To use the function, simply get a directory of backup files and run a command similar to:

New-Restore -dir “X:\backups\db” -server “localhost”

Then, look in your Documents folder for a restore_db.sql file.  This will be your database restore script.

If you want to test drive it yourself using a dummy database, I’ve provided a full demo script and files for download.

Sync-DBUsers
This function I added to help me with database migrations.  Often, I’ll migrate a database via backup and restore.  Afterwards I always need to clean up the orphan logins.  This function is a wrapper for SQL queries that I use to do this, providing me an easy to use interface for managing the process.

The function accepts only two parameters:

  • server – The SQL instance that the database resides on
  • database – Database that contains users you wish to correct

To run the function, just call your server and database name:

Sync-DBUsers -server “localhost” -database “restoredemo”

The function will then find all orphaned users and correct them.  Any orphans it can not find corresponding logins for, it will return as an output list for review and correction.

Get-DBCCCheckDB
When building out my automated restore process, I always wanted to do a DBCC CHECKDB on the restored database to validate everything was ok.  So I wrote this function in order to give me a manageable interface for the process.  The function will then run a DBCC CHECKDB with PHYSICAL_ONLY on my target database and return the results.

The function has three parameters:

  • server – The SQL instance that the database resides on
  • database – Database that you want to check
  • Full – Switch parameter.  If used, the function will execute a full DBCC CHECKDB.  If omitted, it will only perform a check with PHYSICAL_ONLY

To use the function, call the server and database name:

Get-DBCCCheckDB -server “localhost” -database “restoredemo”

The output will be a data table of the check results.

Only the beginning

Hopefully this module can help you as much as it has helped me.  The point is to use this as a tool to facilitate other tasks.  Many of these tasks I will provide examples of over the next month or so, to demonstrate how you can leverage this in your own environment.

I already have a list of enhancements that I plan to add.  I expect that as folks out there review it and play with it, I’ll get more.  Please do not hesitate to post suggestions and comments to the GitHub repository.  The idea, as with many of these other tools, is to build a reliable offering that other DBAs can use in their day to day jobs to make things easier and allow them to work on more important tasks and projects.

TIL: Broken Log Chains

Something that always bites me in the butt from time to time is restoring a database in full recovery model and forgetting to do a full right away.  Then I find out shortly when my log backups start failing and spamming me with error messages.  I’m fairly certain this has happened to you, as well.  It’s an easy thing to forget, especially when you’re dealing with non-critical databases.

Humans are typically bad at remembering things.  That’s why we make task lists, have calendar reminders, document processes, and script for automation.  We either have to remind ourselves to do something or (my preference) build something that’s smart enough to correct our oversight.  So when I found a way to do this with broken log chains, I was pretty happy.

The key was a new dynamic management view (DMV) I discovered.  Most people who know me know I love the DMVs and all the information they can provide.  My new discovery is the sys.database_recovery_status view, which provides recovery info for all the databases.  The key field is the last_log_backup_lsn field, which will be NULL if the log chain has somehow been broken.

What’s cool about this is now we have a programmatic way we can use to validate if a full backup needs to be performed.  By adding a quick check to a log backup script, we can make sure we never have a broken log chain.  A quick and dirty way would be:

if (select last_log_backup_lsn from sys.database_recovery_status where database_id = db_id(‘foo’)) is NULL
begin
    backup database [foo] to disk=<<path to full backup>>
end

backup log [foo] to disk=<<path to log backup>>

This is obviously the most simple approach you can use, but the pattern is the important piece.  By combining this with your standard backup scripts (please tell me you have standard backup scripts), you can always ensure that you never get an error because you forgot a full backup.

Because your script will remember for you.

 

#tsql2sday 52: Stop depending on “it depends”

depends-on-who-is-askingIt seems to be a standard practice for data professionals to say “It depends” when asked for suggestions, recommendations, or outright solutions.  This isn’t surprising, because in the “magic” world of technology the actual solution for a problem usually requires some very specific information and that any number of solutions could work.  Add in to this the nature of our world usually puts you at the whip end when things break, so we are natuarally hesitant to draw a line in the sand.

Stop it.

Let’s face it, much of what we do is fairy dust and moonbeams to those outside our industry.  They don’t understand what we do, just that we make it happen.  And that’s what managers, directors, veeps, and c-fillintheblank-os want when they ask us questions: How can we make it happen and/or how to we make the broken thing go away.  The problem with saying “It depends”, especially in situations like that, is that it sounds more like “I don’t know” and that’s not what you’re getting paid for.  You’re the technical expert.  Whether you’re a consultant or a full-time employee, a seasoned veteran or a fresh-faced n00b, you’re being paid to understand the technology and be able to implement solutions.

Many data professionals, however, are afraid to take this sort of stand.  A lot of this stems from the tremendous amount of risk involved, particularly when tied in to the heavy responsibility we tend to bear.  Our job is to protect the company’s data assets and if we screw up it’s not just our head, but it could be seriously damaging to our company.  So we like to hedge our bets.  Unfortunately, a lot of people in our profession will use the phrase “It depends” as a dodge because they’re afraid of completely taking on that responsibility.

Ultimately, they’re afraid of failure.

It’s a common mantra in life coaching that we can’t be afraid of failure.  Failure is what we learn from and how we grow.  Make a mistake, analyze it, and grow from it.  We’re only human, however, and screwing up is scary.  We don’t want to look bad and we don’t want to get in trouble.  That doesn’t help the people who are looking to us for help.  This is when saying “It depends” as a shield turns into a roadblock, hindering both you and your organization from getting anywhere.

So what do we do about it?  Next time someone asks you for a technical opinion and you don’t have enough detail, ask for that detail.  What’s the RTO/RPO? How quickly does this need to perform?  What’s our budget?  Questions like that to refine the answer.  Maybe outline a solution, but caveat it with qualifiers, such as “I’d probably put tempdb on SSDs, but that assumes we can afford that sort of hardware.”  Maybe there’s a better way to do it, maybe you’re just wrong.  But it’s ok to make a mistake, as long as you’re not making the same mistake.

Most of all, I’d suggest to simply remove “It depends” from your vocabulary.  There’s a dozen ways to get the point across that a solution will require some thought and planning, but I found that when I was forced to use something other than this two-word quip, I had to work harder on my response to really explain the dependencies for a solution.  And the non-technical folks around you are ok with that.  Sure, they don’t want you talking above their head, but they also need to know why things need to be done a certain way.

Some folks might call this leadership.  Others, sticking your neck out. Still others might call this being unnecessarily risky.  I call it doing your job.  Like I said, companies pay us to know our field, it’s time we act like it.  Data is our world, very few people live it in and understand it the way we do, so own that knowledge and champion it.  And stop hiding behind that phrase, “It depends”.

(This month’s T-SQL Tuesday, a grand tradition started by Adam Machanic, is being hosted by Michael J. Swart(@MJSwart).  Great topic choice, sir!)

A PoSH way to skin the cat

Today my friend John Morehouse(@SqlrUs) posted a handy T-SQL script to configure your SQL Server memory based on some rules provided by Jonathan Kehayias(@SQLPoolBoy).  This is a great script and helps automate a lot of the work a DBA needs to do when setting up a new instance. Hopefully, you know by now about my passion for automating and standardization.  However, when reading John’s post, I had the following internal conversation:

“Wow, self, John’s got a neat little script here.”

“Yeah, but T-SQL is so passé.  Everyone does it.  Plus, not sure how easily we could add that into a server deployment script if we wanted total hands-off-ness.”

“Self, I know what you mean.  Nothing wrong with it, it just doesn’t fit my own particular….oh…particular….”

Idiom?

“Yes!  Idiom!  Hrmmm…how about a Powershell script?”

“A Powershell script?  Brilliant!”

And so I sat down and cranked out the following script, which essentially does what John’s script does, just using Powershell.

</pre>
<#
.SYNOPSIS
 Configures a SQL Server instance per the Jonathan Kehayias' guidelines.
.DESCRIPTION
 This script will configure your SQL Server instance per the guidelines
 found in Jonathan Kehayias' blog post: http://www.sqlskills.com/blogs/jonathan/how-much-memory-does-my-sql-server-actually-need/
 The rules are:
 - 1GB for initial OS reserve
 - +1GB per 4GB server RAM up to 16GB
 - +1GB per 8GB server RAM above 16
.PARAMETER
 -instance SQL instance name, i.e. localhost\SQL2012, DBASERVER01
 -apply Switch parameter, call if you want to actually apply the changes. Otherwise, a report will be produced.
.EXAMPLE
 Configure-SQLMemory -instance DBASERVER01 -apply
#>

param([parameter(Mandatory=$true)][string] $instance
 , [Switch] $apply
 )

#load SMO
[System.Reflection.Assembly]::LoadWithPartialName('Microsoft.SqlServer.SMO') | out-null

if($instance.Contains("\")){
 $sqlhost = $instance.Split("\") | Select -First 1
 }
else{
 $sqlhost = $instance
 }

#set memory variables
$totalmem = (gwmi Win32_ComputerSystem -computername $sqlhost).TotalPhysicalMemory/1GB
$sqlmem = [math]::floor($totalmem)

#calculate memory
while($totalmem -gt 0){
 if($totalmem -gt 16){
 $sqlmem -= [math]::floor(($totalmem-16)/8)
 $totalmem=16
 }
 elseif($totalmem -gt 4){
 $sqlmem -= [math]::floor(($totalmem)/4)
 $totalmem = 4
 }
 else{
 $sqlmem -= 1
 $totalmem = 0
 }
}

#if not in debug mode, alter config. Otherwise report current and new values.
$srv = New-Object -TypeName Microsoft.SqlServer.Management.Smo.Server $instance
 "Instance:" + $instance
 "Max Memory:" + $srv.Configuration.MaxServerMemory.ConfigValue/1024 + " -> " + $sqlmem
 "Min Memory:" + $srv.Configuration.MinServerMemory.ConfigValue/1024 + " -> " + $sqlmem/2
if($apply){
 $srv.Configuration.MaxServerMemory.ConfigValue = $sqlmem * 1024
 $srv.Configuration.MinServerMemory.ConfigValue = $sqlmem/2 * 1024
 $srv.Configuration.Alter()
 "Configuration Complete!"
 }
<pre>

Now it should be noted that I’m not going to tell you which of these scripts are “better”. Neither is, really, they’re just different ways of approaching the problem. And that’s the fun of it. When working in technology, cats can be skinned in multiple ways, we just have to use something. Whether it’s T-SQL, Powershell, C#, or some other thing, the point is that you want scripts to automate your work.

What was also cool about this was it gave me another exercise to practice Powershell.  After all, practice makes perfect and when learning something, any excuse to make me use it is good.  John and I actually had a conversation shortly after I wrote the script about these sort of opportunities, and we’ll continue to look for ways to challenge each other for learning.

Edit:  Well, that was embarrassing.  Amazing how sometimes you can get the math little screwy.  I’ve updated this script as of 9:44 PM MST to correct me having the math backwards.  If you grabbed/reviewed this script before hand, please make sure up grab the updated version.

Why I Work With SQL Server

Hot on the heels of my NoSQL posts, I wanted to add a counterpoint to the discussion.  After all, even though I see the value of non-relational technologies, I think it’s important not to lose sight of the value relational databases offer.  In the tech world, it’s too easy to chase those squirrels of new tech (though it’s also easy to get stuck in our old patterns as well).  It always helps to take a step back and see the forest for the trees so we can choose the right path for our enterprise.

It is an understood fact that the tech world gets pretty dogmatic:  Oracle vs. SQL Server, Windows vs. Linux, Java vs. C#, etc.  People will dig their heels in about their choices and why those choices are considered “right” when, at the end of the day, each platform is simply a different approach to various higher concepts.  I tend to view most of these debates as Ford vs. Chevrolet and the only real question to answer is what tool is best for the job.

And believe me when I say that I know that this isn’t a groundbreaking opinion, but it is mine.  :)

That being said, we all have good reasons for selecting the platforms we work with.  For relational databases, it’s fairly evident that my choice is SQL Server.  Before I get into that, let’s first talk about why I lean towards relational over non-relational.  Don’t get me wrong, non-relational is an effective tool, but it’s still a very young technology.  The platforms for it are still growing and maturing, where they still are missing a lot of the reliability we’ve come to expect from our relational platforms.

Couple that with the nature of relational databases:  Joins, keys, and constraints do more for us than simply organize data, they provide functionality to implement and control business logic.  Data integrity is extremely important for many applications and a proper database design will provide you with all the rules to keep your data clean and ordered.  Just as with choosing non-relational stores, it’s a matter of choosing the appropriate tool for the job.  Sometimes that job requires tight control over your data, something that you just can’t get in a NoSQL database.

As for SQL Server as my relational platform of choice, there’s a lot of reasons why I favor it over other platforms.  It isn’t just because it is worked I’ve worked with (for the record, I’ve put some serious time in Oracle as well).  There are really three main reasons why I will promote SQL Server as the database I think people should work with.

Maturity

Let’s face it, SQL Server has been around for a while and Microsoft has had a lot of time to refine it.  Over the past 15 year I’ve worked with it, I’ve seen the addition of lots of neat features that enhance the core RDBMS offering.  At the same time, SQL Server is still a solid relational database and gives users a solid, reliable platform for storing their data.  It’s not perfect and I’ll be the last person to tell you it is, but it certainly is on par with Oracle and PostgreSQL.

Adaptability

Microsoft has seen the writing on the wall.  Some of it is from their own hand, some of it is how the data world is evolving.  However, “the cloud”, in memory structures, and big data are ubiquitous in today’s tech landscape.  Looking at the recent version of SQL Server, it’s apparent that Microsoft is trying to mold the product to live in this new realm.  Consider Hekaton, the ability to span databases between Azure and on-premise, and improvements to columnstore (along with other updates). Microsoft is making investments to keep pace with the changes we’re seeing in the larger technology world and I appreciate the vision they have for the product.

Accessibility

This is the big one for me.  The other two basically tell me that, in going with SQL Server, I’m going to have an established RDBMS platform I can rely on along with Microsoft continuing to improve things to keep pace with other products.  What sets SQL Server apart is the fact that it’s so much easier to work with, for both new folks and seasoned professionals.

First, let’s look at the fact that it’s Windows.  Now while we all lament SQL Server’s default settings, the fact is that almost anyone with minimal experience can get their own SQL Server instance up and running in short order.  This means that the door is open a little wider for people wanting to get into the database world than those who don’t have supporting skill sets for Linux or hacking the registry.  SQL Server ships with wizards and graphical tools to get folks going.  Just make sure you talk to a professional before getting to far.  :)

And that’s the second thing to talk about.  Maybe I’m biased because I’ve been involved in the SQL Server community for so long, but I’m continually amazed by the amount of free material for training and best practices provided by this community, from blogs to Twitter to a user group likely nearby where you can ask questions of people using SQL Server in your area.  It’s so easy to get started with SQL Server.

Yeah, I know I sound like a fanboy at this point (squee!).  Just so we’re on the level, I am well aware of SQL Server’s flaws.  There’s a lot of things that aren’t perfect or were added a couple versions ago but were never finished up (*cough* Management Data Warehouse).  And let’s not get into what’s in Standard Edition versus Enterprise.  Trust me, I get it.  Even with that, though, I feel that SQL Server is the preferred offering at this point for companies that are looking for a solid relational platform. 

2013 in review, 2014 in anticipation

As is my custom, I want to blog about what I did last year and my goals for the upcoming year.  Since I have more “regular” stuff to blog about, I figured I’d sneak this in on a Friday just to get it up here.  :)  The primary reason I do this is that which gets measured, gets done (hat tip Buck Woody).  2013 was a pretty eventful year, but I didn’t quite get everything I had wanted to.  So let’s look back at…

2013

Speaking – Did pretty well here.  I met all my goals as well as actually speaking at the 2013 Summit.  I love presenting and now it’s a matter of building new presentations.

Blogging  – Not quite an epic fail, but this one was not a passing grade.  I basically blogged regularly through to about March, then fell back to my sporadic pace.  I didn’t even blog for about 3 months over the summer.  I’m very disappointed with myself and can do better here.

Certification – This was a success and was a lot easier than I had anticipated.  I sat down to take four of the five exams cold to evaluate where I stood and ended up passing them.  The one I had to study for (and I knew this going in) was the Data Warehousing exam (70-463).  Pretty happy here, but now it’s a matter of finding other options for certification.

I know a lot of folks have a dim view of certs.  I agree with the premise, that the exams and certifications (mostly) are no substitute for real world experience.  The fact that they can be crammed for removes some of the validity.  But at the same time, I think that they can be a valuable barometer for someone’s knowledge.  I fall in Glenn Berry’s camp here, that certifications are like extra credit.  If I see a cert on a resume, I won’t assume knowledge of the subject matter, but it at least shows me the candidate was willing to make an extra effort to improve themselves.

Secret Project – Ok, this more or less flopped.  Basically, I wanted to either write a book or build a pre-con for database monitoring because I feel there’s a real gap out there.  These are still on my slate of things I want to do, but I did not get that done in 2013.  Boo me.

And that was 2013.  I’d give myself a B for the year, mostly because while I whiffed on some of my goals, I also managed to go beyond on a few things.  In addition, I’ve made some career choices and moved things in that direction.  Which now brings me to….

2014

Several of the goals will be maintenance or the same.  I want to maintain on the presenting, so we’re looking at at least 4 SQL Saturdays and a Summit submission.  Blogging needs to be once a week.  This is a real stretch goal, because it’s harder than it looks, but I need to strive for it.  These two items are the base foundation my career grows from, so I need to keep that foundation healthy.

New stuff for 2014 will be:

  • I’m trying to move my career more towards being a data architect.  So for that goal, I will write more posts on general data concepts and higher level planning.  There will still be plenty of administrative stuff, but I need to broaden my blogging scope.

  • I also need to start reading more on cloud computing, modelling, and larger design concepts.  Right now I’m reading Domain Driven Design and my goal is to read 1 book a month on a related topic.

  • I will identify and start working towards a new certification path for design and architecture.  Not sure what this is, but I’ll have something identified by the end of March and start working on it the rest of the year.

2013 was a year of change in my life.  Not earth shattering change, not seismic shifts, but a definite redirection in my aims.  2014 will be solidifying my plan and starting down that path.  My biggest challenge will be sorting out the things that are new and uncomfortable from those that are the wrong direction.  The question I will continue to ask is “Does this move me in the direction I want to go?”.

 

Divide and Conquer

So far, I’ve talked about NoSQL’s use of key-value pairs for storage along with its partitioned nature.  These two concepts are useful, but you’ve probably noticed that it can make querying a lot more difficult, particularly for aggregating data.  That’s where the third and final concept that I want to discuss comes in to play, the process that non-relational data stores use to analyze all the data it has stored across its nodes: MapReduce.

At the core, MapReduce isn’t some wild concept.  The idea is to marshall all of our compute resources to analyze and aggregate data.  Simply put, there are two phases that a non-relational system will use to aggregate data:

  • Map – On each individual node, collect and aggregate the data that matches our predicate.  Pass this to a higher node.

  • Reduce – Aggregate on a “master” node the values passed from the individual nodes.  Once the work from all the individual nodes has been aggregated, we have our answer.

As with most of this NoSQL stuff, it is not rocket science.  It does, however, make more sense if we visualize it.  Let’s say, for example, we wanted a count of all our sales orders within a sales territory and we had a 3-node NoSQL cluster (platform is irrelevant).  A basic MapReduce tree would look something like this:

map-reducesimple

This isn’t all that much different than how SQL Server (or another RDBMS, for that matter) would process it.  If we take the same concept and apply to a basic query from AdventureWorks, we can see the SQL approach to it:

select count(1) from sales.SalesOrderHeader
where TerritoryID = 3

map-reduceqp

You can see from this basic query plan that the only real difference here is that the mapping piece happens on a single node (your instance) rather than multiple nodes.  But otherwise, the process of gathering all the data we need to calculate and then calculating our total aggregate is the same.  It’s the idea of scaling this out over multiple compute nodes that is the major difference.

And scaling is huge.  3-nodes is a piece of cake, but what if we had 10?  Or 30?  Or 100?  Again, the idea is that non-relational data stores can handle terabytes and petabytes of data and you can scale out horizontally to meet those needs.  When we do this, our MapReduce tree can scale to our needs, so we might get something like this:

map-reducefull

Notice the multiple reduce steps.  Just as we can divide and conquer horizontally and collect data across that scale, we can also also do vertical scaling and have multiple reduce phases to further scale the load.  It’s all part of the massively parallel processing model that NoSQL promotes.

There are disadvantages to this, but most of them boil down to the implementation.  Primarily it’s driven around the need in many NoSQL platforms to write your own MapReduce functions using APIs or embedded languages (like JavaScript or platform specific functions).  May platforms are working on SQL-like alternatives to ease this learning curve, but there’s still a lot of growth and maturity needed here and I’m not sure how standardized any of these options really are.

Of course, these MapReduce queries aren’t necessarily going to smoke an RDBMS in terms of performance.  It’s all a matter of design and how you apply the tool.  MapReduce processes are designed, primarily, for batch aggregation of massive amounts of data.  Quantities of data that may not necessarily make sense to store in an RDBMS structure.  As we always say in the database admin world, it depends.  The goal here is to understand how this query process works so we can properly apply it to our own implementations.

Wrapping Up

Hopefully this three-parter has been helpful and educational for you.  I’ve certainly learned a lot about non-relational databases over the past 6 months and my eyes have been opened to new ways to manage data.  That’s really what we’re all here for, anyway.  As the world has more and more data to manage, we as data professionals need to find ways to store it, manage it, and make it useful to those who need it.  NoSQL is just another tool in our toolbox, a hammer to go with our RDBMS screwdriver.  Both valuable, if we use them appropriately.

Before I move on, I did want to share some resources I’ve found helpful in getting my head around non-relational databases:

  • Seven Databases in Seven Weeks – I’m actually not done with this, but it’s a great book for getting your feet wet with several different types of databases.  Be warned, this is not for the faint of heart and there’s a LOT you’re going to have to figure out on your own, but it will guide you through some basic concepts and use cases.

  • Martin Fowler’s Introduction to NoSQL – This is a great hour long presentation that talks about a lot of what I’ve covered here from a different angle.  It’s got great information and explains things at a good level.

  • Ebay’s Tech Blog on Cassandra Data Modelling – Getting my head around how you model data in a Big Table structure was really, really hard.  This two parter will help you figure this out.

I also want to thank Buck Woody(@BuckWoody) for opening my eyes to the world of the cloud and big(ger) data.  It was partly from a conversation with him that I went down this path and the learning that has come out of it for me has been tremendous.

Sharing the Load

Businesses have always struggled with the high cost of relational database servers, both from a hardware and licensing perspective, because current RDBMS platforms take the approach that you will have one or two powerhouse servers handling your data workload.  While this has been the commonly accepted practice, the evolution of the cloud and better distributed computing architectures have given rise to new approaches for data storage and processing.  Now businesses are rethinking their approach to infrastructure with these new approaches, finding ways to provide better performance and higher availability all at a lower price point.

Non-relational data stores embrace this new approach with how they cluster.  Designed in response to the single, super-pricey database server, non-relational systems take the approach of clustering many inexpensive servers into one database.  This is sold as having many advantages to businesses, such as:

  • Horizontal scalability, allowing business to better adapt to growing workloads.  Need more power? Just add another commodity hardware server, you’re not trapped by the box you bought last year.

  • Distributed data across many machines, making the database more available and increasing the durability against failure.  Think of it as RAID for your data, so that if one box goes down, the system is still up and you just need to replace the one failed box.

  • Costs are lower and more manageable.  Well, at least more manageable, where you can buy only the computing power you need and grow incrementally over time.  However, there’s a lot of factors (from virtualization to overall hardware costs) that make the total price point versus a beefy relational server fairly comparable.

It’s certainly a departure from how we’ve dealt with relational platforms to date (though I’m super intrigued by Google’s F1 database).  There are some definite advantages, but most tech people look at this and say “Well, that sounds fancy, but does it blend?”  And by blend, we mean work.  The answer has actually been around for some time now and many large businesses already use it in their relational databases:  partitioning.

Partitioning is a concept we’ve had in the relational world for some time now (and feel free to look at my partitioning posts for more info).  Even before the major RDBMS platforms introduced their own partitioning functionality, DBAs would partition data out into different tables and bring them together as a view (commonly known as partitioned views).  What the non-relational folks do is take it a step further, saying that instead of just partitioning your data across different disks, why not distribute it across different computers?  By doing that, businesses not only get to spread the workload across all these computers, but their data availability is no longer dependent on a single OS/application instance.

Of course, what this means is that non-relational data stores still need some way to determine how data is partitioned.  Remember last week when I talked about all non-relational stores being key/value stores?  Well, those keys become even more important because they become the partitioning key (also called a shard key, by some), the value that non-relational platforms use to divvy up the data between its partitions.  Now, most of these datastores will analyze the key values and attempt to balance that data as evenly as possible, but it’s something you need to be aware of when modelling out your data.  A poorly chosen partition key could mean you’re negating your cluster advantage because your data will end up in one or two nodes.  We’ll come back to this in a bit.

At this point we should talk a bit about the CAP Theorem.  This bit of computer science was introduced back in 2000 and conceptually defines the limitations of a distributed computing system.  The gist of it is that, while we’re trying to accomplish all these wonderful things, any clustered platform cannot guarantee:

  • Consistency (That all the data on all the nodes is the same)

  • Available (That all the data is the same on all the nodes)

  • Partition Tolerant (That the application will remain up despite the loss of one of its partitions)

Commonly, you’ll hear people say that you can only have two out of these three things at any time.  What this means to non-relational platforms is that, since they’re all designed to be partition tolerant with their clustered nodes, we have to chose between being consistent or available.

This is where eventual consistency comes in to play.  See, the problem with our clusters is that we still have the laws of physics involved.  When data is written and that data is distributed across multiple nodes, then there is a matter of time where data has to be communicated across a network and written to disks in different computers.  Most non-relational systems take the availability path here, meaning that data will be offered up from the various nodes, even if it’s not all up to date.  The idea is that it will be eventually consistent, but you might have to wait a bit.

Now our relational brains might kick back on us.  “Shouldn’t our data be 100% accurate all the time!?!?”  ACID compliance, right?  It depends!  Once again, we have to ask ourselves what data is being stored in our database.  Much of this detail only needs to be accurate at a point in time within an application, once it gets into the database we have a little more time before we have to deal with it.  If not, you might consider using a relational database for your data.  It’s all about using the right tool for the job.

One big concern coming out of this distributed model is the question of backups.  Non-relational folks will tell you that backups for disaster recovery aren’t really necessary because you have it built in.  However, most DBAs I know (and I’m one of them) will tell you that disaster recovery and high availability are not the same thing.  And because of the distributed nature, backups become really complex.  So you have to ask yourself when designing or implementing one of these solutions, how will you handle this?  In some cases, it might be a simple matter of backing up the files, but my research so far has shown that this requires an operating system file lock in order to keep things consistent.  Which means you stop your application (i.e., no hot database backups).  There might be other alternatives, either available or in development (the vendors of these platforms recognize the gap), but be aware that it will be a question you have to address.

The key takeaway for this post is that the distributed nature of non-relational systems is a huge advantage for them, but it has some hurdles to overcome.  It’s simply not a magic bullet to all your data problems.  Again, the theme here is using the appropriate tools for the appropriate purposes.  Recognizing the strengths and weakness is key to knowing when you should use a tool and, because NoSQL is a hot trend, it’s vital to understand it in order to properly guide your company’s conversations around the technology.

Next up, I want to talk about how non-relational systems leverage this dispersed architecture for querying and analyzing data.  It’s a strong offering, but not always fully understood.  Hopefully these last two posts are getting you interested in this realm of data management, but remember I’m only touching the surface here.  NoSQL is as deep (though not nearly as mature) as the RDBMS world and there’s a lot more learning than I can provide here.