Tuesday, July 9, 2013

T-SQL Tuesday #44 How to Take Down Prod in 30 Seconds

Hello Dear Reader!  Welcome to my blog on T-SQL Tuesday #44 Second Chances.  I’m hosting this month, and we are writing all about second chances.  My second chance comes from the not to distant past.

One day the DBA team was given a toy frog as some part of swag from a vendor.  We did what any group of grown men would do.  We put a dunce hat on it.  We decided that whoever screwed up next would have it sitting on their cubical wall, and we would pass it around as the next offender appeared.  A fun little way to pass the time and rib one another.  

No sooner had I participated in developing this badge of shame, than I earned it.  The title says it all.  How to take down prod in 30 seconds, but I should clarify.  Not some, not half, but allllllll you’re clustered servers in just 30 seconds.


I have to give a special Thank you to my buddy Dan Taylor (@DBABulldog | Blog), you see I remembered I had the frog.  I had forgotten what I had done to earn it.  It was sitting on the edge of my mind, but no matter how hard I tried I could not remember it.  It was sitting in a fog just out of reach.   An itch that I couldn’t scratch.  A few words out of his mouth and it all came flooding back.  As a good friend we've swapped many stories over the years, without his memory (which is better than mine) I would have had to go with a less interesting tale of woe.

“So Balls”, you say, “How did you screw up?”

Well Dear Reader I had an unfortunate convergence of unexpected anomalies that peaked in a spectacular crescendo of a mistyped password.   Yes a mistyped password.  My second chance would be typing in correctly.  The next best thing is explaining it so you hopefully never have to feel the same pain.

I SOLEMNLY SWEAR I AM UP TO NO GOOD


I had a new production SQL 2008 R2 Instance to install.  Things were going pretty smooth.  I got up to the screen where you punch in the password for the service account, and that’s when it all went wrong.  

I mistyped the password.  GASP, SHOCK, AWE, OTHER SUCH EXPRESSIONS!!!!!

Normally I would agree no big deal, but the next time I punched in the password I didn't get a password error, I got an error informing me that the account was locked.  Enter the series of unfortunate events.

Imagine you live in a world where all of the Prod servers are using the same service account.  Imagine that you've suggested this be changed but it ended up on the “That’s a good idea we’ll tackle that another day” pile.  Imagine that you are not using Microsoft Clustering for your Clustered servers, and that the inventive Server Engineers rolled their own “health check”.  Imagine that your current password policy locks out when you mistype the password somewhere between 3-8 times.

“But Balls”, you say, “You only typed your password once?  Not 3-8 times!”

Exactly.  There’s a bug in the installer for SQL Server 2008 R2.  When you click the next button after filling out the service account information, you authenticate at least twice for every account you type in.  Not so in SQL 2005 or SQL 2008 (not R2).  But in SQL 2008 R2 one mistyped password counts a whole lot more.  Depending on the services being installed, enough to lock out an account.

Then you are left to watch the manual health checks fail because the account is locked out, attempt a cluster failover, only to be locked out on the other side because the SQL Service account was locked out.

You catch your error quickly.  Run to the Team Lead, report what has happened, hoping this can get fixed before the inevitable outages begin.  Then you race back to your desk.  You have an uncomfortable phone call to place to the help desk.

Imagine that while this unfolds you are waiting on hold for the help desk to open a ticket (you have to follow protocol), that will get assigned to an engineer, who will pass it on to AD Services.  Queue the uncomfortable elevator music.

Co-workers scrambling in the back ground, like the bull pin of a busy newspaper.  Someone is keeping an active wipe board of what servers are now down, every minute someone in your cube starting to say “Have you….” Only to be cut off by your response “Still on Hold”.  Queue the music.

Other co-workers are fielding calls from App Teams reporting that their applications are offline.  Other co-workers trying to reach managers that can bypass a well-orchestrated bureaucratic separation of duties that results in elevator music while you are still on hold. Did I mention being on hold?  While on hold forty-five minutes can feel like weeks.

The saving grace (for my job), the bug I found was easily to duplicate.  It was easy to see that this behavior was not in previous versions.  As an added bonus those service accounts started becoming unique real quick.

DEMO: THE BUG I LEARNED ALL ABOUT

We’ll skip ahead a bit.  Say you are installing SQL Server 2008 R2.  We’ve gotten up to the Server Configuration where we are punching in our passwords.  First let’s open up our Event Viewer, click on our Security Tab and clear it out. 


*If this were anything other than my personal VM I would backup the log so we could restore it, do not clear out a security log on a prod server without proper guidance.


Now the only event in our log is the event denoting that our log has been cleared.  Back to SQL Server. 
 

We will click on the Use the same account for all SQL Server services button and type in our .\s-sqlsrv service account.  Definitely not following best practices here.  SQL Engine, SQL Agent, and SSIS all getting the same service account. 



Let’s Type the password in wrong and see what happens?  Click OK.  Click Next.



SQL reacted just like we thought.  Theoretically we should have 1 bad login check right?  The same user name was in use, we don’t need to validate it 3 more times.  One should do.  Perhaps at most we’ve got three validation checks right?


Let’s head over to our trusty error log and see. 

We’ve gone from 1 to 13 errors in the click of a button.  How many failed logins do we have?  Not 1, 2, 3, 4, 5, 6, 7, but 8 failed logins from one attempt.  You’ll get this if you use the button or if you do not use the button.

You may be asking did this get fixed in SQL 2012?



One look at the installer and you can see the button is gone.  Let’s punch in the same service account name and an incorrect password.



And now on to our error log.



Wow!  Six entries, now we are looking at 3 entries per account.  Nope didn’t get any better.

WRAP IT UP

Long story short, make sure those passwords are correct.  Personally I like to use a utility like KeePass to generate, store, and copy my passwords from.  Anything that keeps me from typing.  Or as the case may be mistyping J.

As always Dear Reader, Thanks for stopping by!

Thanks,

Brad






No comments:

Post a Comment