Friday, November 9, 2012

PASS Summit Page & Row Compression Deep Dive

Hello Dear Reader!  Today is the Big day for me here at the PASS Summit 2012.  I'll be presenting the Page & Row Compression Deep Dive today at 9:45 am until 11:00 am in room 305-TCC.

The last day of the Summit is upon us and with it almost over I get more an more anxious to get back home and see my family.  Soon we will be in the weekend and, for most of us, we will no longer be in Seattle.  I've got some good stuff that I'm cooking up just to make sure that you have plenty of compression goodness to play around with.

"So Balls", you say, "You did this presentation last year, why should I attend?"  

Great Question Dear Reader!  Away we go! .




PAGE & ROW COMPRESSION DEEP DIVE

First let's start with the abstract:

Page and row compression are powerful new tools. Introduced in SQL Server 2008 RTM and enhanced with Unicode compression in SQL Server 2008 R2, the compression tools now work with SQL Server 2012's spatial types. We can turn compression on, and we can turn it off – but we want more! 

What are access methods and how in the SQL Engine do they affect compression? What are the “gotchas” of page compression? How does compression differ in the way it treats leaf and non-leaf level pages? What additional functionality did we get in DBCC Page, DMVs, Perfmon Counters, and Extended Events to support compression? And how do complex query plans affect compression? Come to this session to find out!


There are a lot of very cool and deep things we did last year, but as the year goes along I keep trying to dig deep down and find some really cool stuff for us to look at.  The same thing is great but this year I want to push it even deeper.

I've got 3 things to tempt you Dear Reader:

1. We will go deeper into the structure of a record this year
2. We will discuss Mid & End Page Splits and What they mean for Compression
3. We will tackle Statistics in Compressed Indexes and how Page Compression attempts affect them.

I've got new demo's that I'll be rolling out.


My Slide deck and demo's are already live on the PASS Website click here to get them.  Here is a link on the Resources page for them as well, click here, and go down to the PAGE & ROW COMPRESSION DEEP DIVE header.  Under that I've got a section for PASS Summit 2012, the latest version of the slide deck as well as all of the demo scripts are there as well!

Have a great last day and safe travels home Dear Reader, I hope to see you this morning!

Thanks,

Brad

Wednesday, November 7, 2012

SQL PASS Summit 2012 Keynote Day 1

Hello Dear Reader!  I'm sitting in the Keynote now writting up a summary of the event.  More to come!

Friday, October 26, 2012

DBA Question of the Day: Statistic




http://www.flickr.com/photos/_barney/5177975707/
Hello Dear Reader, yesterday I posed the question to you what are Statistics?  We could get down and dirty of how the internals of SQL Server use Statistics (and we will), but first let’s talk about the concept because it is at heart a very simple one.

In America we are in my favorite season, Fall.  The weather is cool but not to cold.  The leaves turn beautiful colors, the smell of wood burning in a fire place, fire pit, or the general smokey scent that goes with the great out doors this time of year always springs to mind.



Unfortunately we are also in my least favorite time of the every 4 years, Election Season.  But Election Season does tie nicely into our subject of the day.  Statistics are nothing but POLLS!  In election season they say things like, “Do you like this Candidate?  Do you like this particular issue? Do you believe they kill puppies when people aren’t looking?”, and we get a break down of Yes, No, and Undecided.

In SQL Server Statistics are SQL’s way of Polling our Data. 

GATHERING STATISTICS:
SQL Server: “I See you are a column Called First Name”

Registered Column: “Yes I am”

SQL Server: “Mind if I ask you a few questions (QUERIES) and create a Poll based off of how you answer?”

Registered Column: “Go Right Ahead”


Breaking New Folks Registered Column has 28% of his data between Alex and David, only 5% between David and Nick, and a WHOPPING 77% between Opie and Zachary!

USING STATISTICS:
Now the next time we ask a question (QUERY) we (THE QUERY OPIMIZER) have an expected number of people in a particular demographic (VALUE RANGE).  We know if our candidate wants to know how all the Opie’s through Zachary’s will answer a question (QUERY), we can plan on how to best collect that information (AKA HOW THE QUERY OPTIMIZER CREATES A QUERY PLAN).  We can then figure out how many people we need to send out (WORK THAT NEEDS TO BE DONE SORT, SPOOLS, HASH’s) in order to collect that data.  For example we need less people to collect data from David to Nick (NESTED LOOP JOIN) than we do to collect data from Opie to Zachary (HASH JOIN).

OUT OF DATE STATISTICS:
Now that we have our Poll, the next time we have a question (QUERY), the folks on the new screen will say, our expected result was 77% when we selected the range of Opie’s and Zachary’s however we found that only 58% actually resided there.

Our population was moving (DATA WAS BEING UPDATED/INSERTED/DELETED) and our Statistics were not up to date.  If we had a plan to collect our data (QUERY PLAN) using a lot of people to go out in the community and collect polls we may have sent out to many and over allocated our resources (PICKED A BAD PLAN IE HASH JOIN INSTEAD OF NESTED LOOP).  If we still had 77% of Opie’s to  Zachary’s our polling plan (ESTIMATED ROWS RETURNED) 
would be good, but it wasn’t (ACTUAL ROWS RETURNED).

UPDATING STATITICS:
So our Statistics were out of wack on our poll.  Something was off.  If we had a big plant closure in our town or a big company laid off a lot of people (PURGE PROCESS ON A TABLE), then we would expect some population shift.  If we knew 20% of people (20% OF ROWS IN A TABLE) were going to be laid off we could expect some would move in with other family members or move to find new work.  We would probably send people out in the community to get new polls (AUTO UPDATE STATITICS) and find out what the new data was for.  We found 58% Alex and David,  25% between David and Nick, and  17% between Opie and Zachary

Regeneration of Statistics causes us to re-think our plan to Poll Opie’s to Zachary’s (QUERY RECOMPLIATION TO GET A NEW QUERY PLAN) in order to send the right amount of people out to ask questions (QUERY) to get our candidate some information (GET OUR DATA).  Now we see that we need much less people (NESTED LOOP JOIN) to poll Opie thru Zachary than we previously did (MERGE JOIN) and our polling plan (QUERY PLAN) reflects that.

GETTING INTERNAL

http://www.flickr.com/photos/photo645a/3995665841/
Now that we have a general idea of how things work let’s spell it out a purely in SQL Server Language.   Clustered Indexes and Non Clustered Indexes automatically have statistics generated for their key columns.  However there are more columns in a table than just indexed columns.

SQL places those Statistics in an object named a Histogram.  A Histogram contains entries (will only ever have a max of 200) that show data values spread over a range.  This allows the Query Optimizer when constructing a plan to say, “Statistics, I’m going to run this query on this table how many rows can I expect to get back?” and then plan accordingly.

We have the following table named Students with columns StudentID, SSN, FirstName, LastName, MiddileIntial, BirthDate, and Gender.  Clustered Index on StudentID (no debate in indexes right now this is just a demo J ).

*All of the code to create a Students table along with other and generate random data was uploaded to my resources page yesterday as a part of my Trimming Indexes Getting Your Database In Shape presentation.  Download that code and play around with it however you like!

                create table students(
                   studentID int identity(100000,1)
                   ,ssn char(9)
                   ,FirstName varchar(50)
                   ,MiddileInitial char(1)
                   ,LastName varchar(100)
                   ,BirthDate datetime
                   ,Gender char(1)
                   ,constraint pk_students_studentID primary key clustered (studentID)
                   )

If we insert a couple rows into this table (*go get the code!) and then go look at SSMS.


We see that we have statistics created for my primary key.  If you right click on the Statistics and open them up and then click on Details you will see a whole host of information.  You can see when the statistics were generated, when they were last updated and what the range is.



You can see that my Average Length is 4.  That is because my Primary Key on column StudentID is an INT or a 4 byte fixed length value.  You can see in my range what my RANGE_HI_KEY is to my RANGE_ROWS.  

For my 200 different samples you can see how many rows fall in that data DISTINCT RANGE ROWS.

If I said to the Query Optimizer

SELECT
     studentid
FROM
     dbo.students
WHERE
     studentid between 104030 and 108969

I would expect to get back 4940 rows, BUT my statistics are OUT OF DATE and do not refelect that.  So when I execute my query, and include actual execution plan, this is what I get back.



My options at this point are to update my statistics.

     UPDATE STATISTICS dbo.students pk_students_studentID WITH FULLSCAN

And now my query plan looks like this.



As you can see the Optimizer expected the number of results it got back.  In my Query Plan (a simple trivial one), the statistics did not shift my outcome.  But had I joined on the Courses table or the Grades table it could have completely changed my plan.


TWITTER IT UP

So the question on twitter yesterday that spawned all of this was should I delete old statistics?  My answer to that is no.  You should update them.  The Histogram is not normally a big space consuming object.  They are not like unused Indexes.  Unused indexes occur IO, they must be maintained as the base structure is updated.  This costs your system.  Statistics just off the query optimizer a path, if the statistics are old and the range is still valid leave them be. 

Whenever a query comes along you will save the optimizer the trouble of regenerating them.  Because if they are not there we have to create them, but that is an example for another day.

As always Thanks for stopping by!

Thanks,

Brad