Friday, October 26, 2012

DBA Question of the Day: Statistic




http://www.flickr.com/photos/_barney/5177975707/
Hello Dear Reader, yesterday I posed the question to you what are Statistics?  We could get down and dirty of how the internals of SQL Server use Statistics (and we will), but first let’s talk about the concept because it is at heart a very simple one.

In America we are in my favorite season, Fall.  The weather is cool but not to cold.  The leaves turn beautiful colors, the smell of wood burning in a fire place, fire pit, or the general smokey scent that goes with the great out doors this time of year always springs to mind.



Unfortunately we are also in my least favorite time of the every 4 years, Election Season.  But Election Season does tie nicely into our subject of the day.  Statistics are nothing but POLLS!  In election season they say things like, “Do you like this Candidate?  Do you like this particular issue? Do you believe they kill puppies when people aren’t looking?”, and we get a break down of Yes, No, and Undecided.

In SQL Server Statistics are SQL’s way of Polling our Data. 

GATHERING STATISTICS:
SQL Server: “I See you are a column Called First Name”

Registered Column: “Yes I am”

SQL Server: “Mind if I ask you a few questions (QUERIES) and create a Poll based off of how you answer?”

Registered Column: “Go Right Ahead”


Breaking New Folks Registered Column has 28% of his data between Alex and David, only 5% between David and Nick, and a WHOPPING 77% between Opie and Zachary!

USING STATISTICS:
Now the next time we ask a question (QUERY) we (THE QUERY OPIMIZER) have an expected number of people in a particular demographic (VALUE RANGE).  We know if our candidate wants to know how all the Opie’s through Zachary’s will answer a question (QUERY), we can plan on how to best collect that information (AKA HOW THE QUERY OPTIMIZER CREATES A QUERY PLAN).  We can then figure out how many people we need to send out (WORK THAT NEEDS TO BE DONE SORT, SPOOLS, HASH’s) in order to collect that data.  For example we need less people to collect data from David to Nick (NESTED LOOP JOIN) than we do to collect data from Opie to Zachary (HASH JOIN).

OUT OF DATE STATISTICS:
Now that we have our Poll, the next time we have a question (QUERY), the folks on the new screen will say, our expected result was 77% when we selected the range of Opie’s and Zachary’s however we found that only 58% actually resided there.

Our population was moving (DATA WAS BEING UPDATED/INSERTED/DELETED) and our Statistics were not up to date.  If we had a plan to collect our data (QUERY PLAN) using a lot of people to go out in the community and collect polls we may have sent out to many and over allocated our resources (PICKED A BAD PLAN IE HASH JOIN INSTEAD OF NESTED LOOP).  If we still had 77% of Opie’s to  Zachary’s our polling plan (ESTIMATED ROWS RETURNED) 
would be good, but it wasn’t (ACTUAL ROWS RETURNED).

UPDATING STATITICS:
So our Statistics were out of wack on our poll.  Something was off.  If we had a big plant closure in our town or a big company laid off a lot of people (PURGE PROCESS ON A TABLE), then we would expect some population shift.  If we knew 20% of people (20% OF ROWS IN A TABLE) were going to be laid off we could expect some would move in with other family members or move to find new work.  We would probably send people out in the community to get new polls (AUTO UPDATE STATITICS) and find out what the new data was for.  We found 58% Alex and David,  25% between David and Nick, and  17% between Opie and Zachary

Regeneration of Statistics causes us to re-think our plan to Poll Opie’s to Zachary’s (QUERY RECOMPLIATION TO GET A NEW QUERY PLAN) in order to send the right amount of people out to ask questions (QUERY) to get our candidate some information (GET OUR DATA).  Now we see that we need much less people (NESTED LOOP JOIN) to poll Opie thru Zachary than we previously did (MERGE JOIN) and our polling plan (QUERY PLAN) reflects that.

GETTING INTERNAL

http://www.flickr.com/photos/photo645a/3995665841/
Now that we have a general idea of how things work let’s spell it out a purely in SQL Server Language.   Clustered Indexes and Non Clustered Indexes automatically have statistics generated for their key columns.  However there are more columns in a table than just indexed columns.

SQL places those Statistics in an object named a Histogram.  A Histogram contains entries (will only ever have a max of 200) that show data values spread over a range.  This allows the Query Optimizer when constructing a plan to say, “Statistics, I’m going to run this query on this table how many rows can I expect to get back?” and then plan accordingly.

We have the following table named Students with columns StudentID, SSN, FirstName, LastName, MiddileIntial, BirthDate, and Gender.  Clustered Index on StudentID (no debate in indexes right now this is just a demo J ).

*All of the code to create a Students table along with other and generate random data was uploaded to my resources page yesterday as a part of my Trimming Indexes Getting Your Database In Shape presentation.  Download that code and play around with it however you like!

                create table students(
                   studentID int identity(100000,1)
                   ,ssn char(9)
                   ,FirstName varchar(50)
                   ,MiddileInitial char(1)
                   ,LastName varchar(100)
                   ,BirthDate datetime
                   ,Gender char(1)
                   ,constraint pk_students_studentID primary key clustered (studentID)
                   )

If we insert a couple rows into this table (*go get the code!) and then go look at SSMS.


We see that we have statistics created for my primary key.  If you right click on the Statistics and open them up and then click on Details you will see a whole host of information.  You can see when the statistics were generated, when they were last updated and what the range is.



You can see that my Average Length is 4.  That is because my Primary Key on column StudentID is an INT or a 4 byte fixed length value.  You can see in my range what my RANGE_HI_KEY is to my RANGE_ROWS.  

For my 200 different samples you can see how many rows fall in that data DISTINCT RANGE ROWS.

If I said to the Query Optimizer

SELECT
     studentid
FROM
     dbo.students
WHERE
     studentid between 104030 and 108969

I would expect to get back 4940 rows, BUT my statistics are OUT OF DATE and do not refelect that.  So when I execute my query, and include actual execution plan, this is what I get back.



My options at this point are to update my statistics.

     UPDATE STATISTICS dbo.students pk_students_studentID WITH FULLSCAN

And now my query plan looks like this.



As you can see the Optimizer expected the number of results it got back.  In my Query Plan (a simple trivial one), the statistics did not shift my outcome.  But had I joined on the Courses table or the Grades table it could have completely changed my plan.


TWITTER IT UP

So the question on twitter yesterday that spawned all of this was should I delete old statistics?  My answer to that is no.  You should update them.  The Histogram is not normally a big space consuming object.  They are not like unused Indexes.  Unused indexes occur IO, they must be maintained as the base structure is updated.  This costs your system.  Statistics just off the query optimizer a path, if the statistics are old and the range is still valid leave them be. 

Whenever a query comes along you will save the optimizer the trouble of regenerating them.  Because if they are not there we have to create them, but that is an example for another day.

As always Thanks for stopping by!

Thanks,

Brad

Thursday, October 25, 2012

DBA Question of the Day


When I used to work in an office I had a stack of flash cards and occasionally I'd grab a few, see if I still knew them and the answers and I'd walk around and discuss them with some of the other DBA's.


At Pragmatic Works I do this on our DBA DL list and I've been debating doing an occasional Question of the day series.  This Blog is inspired by a #SQLHELP conversation that I just saw my friend Mike Fal (@Mike_Fal| Blog) have regarding this very subject.  So here we go, first topic we will tackle in the old bag of flash cards:


What are Statistics and how are they used?  


Answer tomorrow.


Thanks,Brad

Wednesday, October 24, 2012

PASS DBA Virtual Chapter Deck and Demo's Live

Hello Dear Reader, I wanted to say a quick Thank You to the PASS DBA Virtual Chapter for having me  today.  My Deck and Demo's are up on the Resource Page.  Thank You to the 150 people who took the time to attend as well!  We couldn't do this without you.

Thanks Again,

Brad

Tuesday, October 23, 2012

PASS DBA Virtual Chapter Trimming Indexes, Getting Your Database in Shape

Hello Dear Reader!  Tomorrow at 12 noon eastern I’ll be presenting for the PASS DBA Virtual chapter, click here to sign up for the meeting.

If you aren’t a member of/or familiar with PASS it is the ProfessionalAssociation for SQL Server.  PASS put’s together great things for us throughout the year like SQL Saturdays, which are put together by local PASS User Groups (click here to find the one in your area) groups that meet FREE monthly and have presentations on different SQL topics, The PASS Summit (Largest SQL Server conference in the world!), 24 Hours of PASS (free 12 spans of great training), and the PASS Virtual Chapters.  Virtual Chapters range 16 different subjects and 3 different languages, soon to be four different languages!

Joining PASS doesn’t cost you a dime, and I don’t get a penny for it, but it opens the door to a large amount of free technical content and training.  If you are not familiar I’d encourage you to click on the above links and become familiar with PASS today!

“So Balls”, you say, “What is this presentation you’re doing?”

Glad you asked Dear Reader!  I’m presenting for the DBA Virtual Chapter, 1 of the 16, and my subject is Trimming Indexes, Getting Your Database in Shape.

I’LL TAKE THE #2 SUPER SIZED

Here’s the abstract and then we’ll talk a little more:

Indexes are a wonderful thing. We should be using them, and we should be maintaining them. But over time our production databases start to look a little pudgy around the mid-section. Maybe they are a little bloated with Unused Indexes, maybe they have Duplicate Indexes, and possibly even Reverse Indexes. The first step to fixing these problems it so see if you have them and if you do the second is to set about fixing them. You could be costing yourself CPU cycles, I/Op's, and space and never even know it.

If you’ve been a DBA for a while you will inevitably inherit a system where you find indexes being used in less than optimal ways.  A lot of this is created by turn over in a company, going with all of the suggestions from DTA (Database Tuning Advisor), or having too many cooks in the kitchen.

It is possible to get things like Reverse Indexes, Duplicate Indexes, and unused Indexes.  You may be asking,  “What do those terms mean?  What secret ninja SQL Language are you speaking?  I know Clustered and Non-Clustered, but what-in-the-sam-hell is a REVERSE index!?”

It’s alright Dear Reader, no new secret terms.  A Duplicate index is just an Index where the physical structure exists more than once on a table.  Take the following Table:

CREATE TABLE Students(
          studentID int identity(1,1) primary key clustered
          ,ssn char(9)
          ,firstName varchar(50)
          ,middleInitial char(1)
          ,lastName varchar(100)
          ,gender char(1)
          )

If we created a Non-Clustered Index on the SSN column and we called it nclx_Students_SSN, and then someone else made a Non-Clustered Index on the SSN column and called it nclx_Students_SSN2 we would have a duplicated index.

“But Balls”, you say, “I would never do that!”

Of course not, you wouldn’t ever do that on purpose.   As code gets migrated from Dev to Production perhaps the Developer or Jr DBA adds an index that they didn’t realize you already has in place.  Or maybe you get a query plan with a “Missing Index hint” in Dev, only that index had been created as an urgent Production change, and never got implemented in Dev.  Migration comes around and as long as the names are different, WHAMO, you have two Non-Clustered Indexes on your SSN Column.

This example might not seem that bad, but imagine a 50 row table with a duplicate Non-Clustered Index on 5 rows, 10 rows, or 15 rows.  That’s a lot of extra data having to be persisted to disk and maintained.

"I wish my abs..I mean... databases where in shape"
Using the previous table let’s know make a Reverse Index.  We’ll create a Covering Non-Clustered index for a stored procedure that requires the SSN, FirstName, and LastName fields.  Some farther down the road you’ve left that company and a new developer is writing a different block of code and a different stored procedure and they create their Non-Clustered Index on LastName, FirstName, and SSN.

Then you have Unused Indexes.  These are the indexes that it seemed like a good idea to build, but nobody is using them.  In some places you have code that gets retired, but we still need the database structures,  in the Data Modeling phase Indexes were designed that were not used, or Database Tuning Advisor recommended it and it just wasn’t used.

Finding these are important because we are maintaining them, but the slackers do not contribute to our query performance. 

WRAP IT UP

So our goal for the hour will be to discuss Indexes make sure that we have a good foundation in them and what they store so we can understand why these 3 types of indexes are bad, and then use some scripts and DMV’s to identify them.

I'm also doing this presentation in a much longer format for SQL Live 360 in December of this year, as well as a couple more presentations.  Click on this link to check out Live 360!

I hope you’ll get a chance to stop by and join us!

Thanks Again,

Brad

Thursday, October 18, 2012

DBA Study Guide


http://www.flickr.com/photos/caledonia09/4999119065/

 Hello Dear Reader, over here at Pragmatic Works we’ve been growing like weeds.  For the most part we are looking for Sr level people for Sr. Level positions.  Part of that process is interviewing.  Going for a job as a Sr. Consultant is a bit different than going for a job as a DBA. 

Today’s market for DBA’s is quite good.  If you are looking there are jobs out there.  A lot of the time after weeks or months of interviews when the “ideal” candidate has not been found you tend to lower the requirements.   It’s the Animal House “We need the dues” moment. 

Only for a business it is we need a butt in the seat.  You start asking the questions can we find someone with the right attitude, someone who can learn, someone who might not be at the level we want, but we can work with.  Often you can find a diamond in the rough and grow that person into the experience level you wanted.

In the Consultant biz it’s a bit different.  You can do that for Jr or Mid level jobs, but Sr level positions require you to really know your stuff.  You cannot expect a person to know everything, and one person’s Sr is another person’s Jr.  Not to mention there is a wide area of DBA expertise to be considered.  But we have to draw a line in the sand, and Knowledge is very important.


Can you answer some of the following questions:

  1. What is a heap?
  2. What is a Clustered Index, a Non-Clustered Index, and what are the differences between the two?
  3. What is a Page Split?  
  4. What is a Forwarding Pointer?
  5. Why do they matter?


If you cannot then I wanted to toss out some learning resources that cover a wide breath of area.  This is similar to the Microsoft Certification exams where they say know how to Baseline a server, couple different ways to skin that cat, so I know there are a LOT of different things to each very general area.
(*Note no actual cats were skinned in the process of writing this blog).  

This is just a collection of books that I’ve read over the years.  Some go in depth in particular areas, some are general and cover many.   My buddy Mike Davis (@MikeDavisSQL | Blog) wrote a similar list for BI folks if you are interested in that click here to read more.

But I wanted to toss them out so if you are looking for a good book you can find one.  Just looking to grow in a particular area?  Then these will help you as well.


Internals:  If you are looking for a book on Internals you cannot go wrong with Kalen Delaney(@SQLQueen | Blog).  The 2012 Internals book is due out in November, and I can’t wait to read it.  This book has many wonderful contributors and is well worth the money even though a new one is on the way out.  I cannot recommend this book enough.












Internals/Extended Events/Troubleshooting:  Christian Bolton (@ChristianBolton | Blog) put together an All-Star team for this book (a 2012 edition is due out soon as well).  It not only covers internals but tools to diagnose them from some of the Premier experts in the field.  I put this neck and neck with any book.  If you work with SQL Server 2008/R2 you should own a copy.













Query Tuning:  Grant Fritchey (@GFritchey | Blog) is a damn nice guy.  I don’t understand why people think he’s a Scary DBA, (Grant thanks for the advice on the Katana collection and sharpening swords in front of the daughter’s boyfriend before dates, priceless).   I just don’t understand the scary thing at all.  Regardless of his disposition Grant is the guy that wrote the book on Query Tuning and Execution Plans.  He is a master in this field and the only people I would regard higher are the people Grant would recommend.














Clustering:  Alan Hirt(@SQLHA | Blog) is to clustering what Grant Fritchey is to Query Tuning.  I’ve attended Alan’s pre-con’s, read his books, and watched his generous and free advice via #SQLHelp.  If you are working in clustering you should have Alan’s book it will point out best practices and save you head ache’s (I’m looking at you government SOC’s Image when setting up a 2008 Cluster).













Replication: I wanted to recommend a replication book however, I haven’t purchased this one.  My friend and co-worker Chad Churwell (@ChadChurchwell | Blog) is one of the smartest replication guys I've ever met and he recommends it. I’m making the recommendation because of Chad and I have done more replication as a Consultant that I did as a DBA.  I’ve set it up, I’ve fixed it, I’ve learned how to find out when it’s broken, what broke it, and why.  I’d also bet I’m not alone.  I’ve only read the free pre-view of the book and chapter wise it summarizes everything I’m looking for an Expert in Replication (other than experience).














Mirroring:  I would put Robert Davis(@SQLSoilder | Blog) in the realm of Mirroring what Grant and Alan are to their respective books.  Robert has blogged incredibly useful and real world information about mirroring.  AND YES I understand that Always On Availability Groups are the way to go.  However, not everybody is on SQL 2012, and a solid understanding of Mirroring allows you to better understand all the goodness that is Always On Availability Groups.














Hardware and Virtualization:  When it comes to hardware you don’t get much better than Glenn Berry(@GlennAlanBerry | Blog).  From his free Assessment Scripts on SQL Server Performance (Glenn's is here) is essential when you go onto a new server for the first time and try to holistically figure out what is going on right and wrong.  The first chapter alone taught me more about CPU’s and which to choose than years of experience had.  I was able to use this knowledge immediately.













Performance Indexing: Jason Strate (@StrateSQL | Blog) and Ted Krueger (@Onpnt | Blog) are incredibly smart guys.  SQL MVP’s, years of experience, and deep knowledge all combine to give you an answer to the age old question ‘What should I index and Why?’.  Indexing is a core thing that DBA’s should know about.  Adding, removing, finding good ones, and identifying bad ones are important.  Not to mention the answer to all of my previous questions are in this book.










SQL Server 2012/ SQL Azure/Powershell:  I work with some pretty smart guys.  SQL MVP’s, Consultants, and their friends are just as smart.  These two books are a collabertaive effort between brilliant people  Adam Jorgensen (@AJBigData | Blog), Brian Knight (@BrianKnight | Blog), Jorge Segarra (@SQLChicken | Blog), Patrick Leblanc (@PatrickDBA | Blog), Aaron Nelson (@SQLVariant | Blog), Julie Smith (@JulieChix | Blog)…And MORE (sorry for the people I left out)!  If you are looking for information on SQL 2012 and how to use it go to the Bible and their other book on Professional Administration.















 WRAP IT UP!

A lot of books I know and no I don’t expect you to read all of them before an interview, but there are a lot of common theme’s in the world of SQL Server.  A good expert should be EXCITED about what they learn about.  They should be able to pick something tell me what they know, and I’d like them to do it in a way that I’m excited about it by the time they finish.

I love going to SQL Saturday’s, PASS Events, and Conferences because they make me excited to learn.  And I really love to learn.  Find something that you are passionate about, and learn it really good.  That kind of learning and passion is infectious and is exactly what makes all of the authors I’ve mentioned such great SQL Server professionals. 

Hopefully, whether you’re looking for a job or not, it will help you find something that you love to learn about.

Thanks,

Brad