Showing posts with label Database Administration. Show all posts
Showing posts with label Database Administration. Show all posts

Monday, December 5, 2016

Optimize For Unknown

Hello Dear Reader!  It's been a while.  I was working with a friend and we came across an interesting problem.  They had a large amount of skewness/data skew.  This led to some performance issues for them.  The way this manifested itself was in a set of queries that ran quickly, normally within seconds.  Then occasionally they ran much longer.  To be precise, they ran about x800 times longer.


As you can imagine this is a less than ideal situation for a production environment.  Their solution was to add OPTION (RECOMPILE)  to all of their stored procedures.  This solved the issue with their data skew.  It caused additional side effects.  Increased CPU as every stored procedure now had to recompile before execution.  No stored procedure could experience plan reuse.  Using DMV's to track stored procedure utilization and statistics no longer worked.


"So Balls", you say, "What is the alternative?  Is there an alternative?  And what in the name of King and Country is skewness/data skew!"

Ahh great question Dear Reader!  Dear Reader why are you English today?

"Because you watched the Imitation Game last night, and Benedict Cumberbatch's voice is still stuck in your head."

Right as always Dear Reader!  Good point let's explain it and then do a Demo!


SKEWNESS/DATA SKEW


Skewness is a term from statistics and probability theory that refers to the asymmetry on the probability distribution of a real valued random variable about its mean.   This could get complicated quickly.  In simpler terms that I can understand it means that there are patterns based on variables with an assigned real value.  Based on those variables skewness can be determined and it is the
difference of the normal.

How does this effect our query plans.  With data skew we have a over abundance of data that fits one statistical model and it does not fit for others.  This means the way the SQL Server Cardinality Estimator estimates for one may be different for another based on statistics.

Here's a quick example.  I have a school with 100,000 students.  Every student has a combination of 10 different last names.  On average one could assume that every 10,000 students will have different last names.  If we randomly assign these values, there will be a slight skewness.  Most of the ranges will be similar.  For this example I'll use my students table from my college database.

       select
              lastname, count(*)
       from
              dbo.students
       group by lastname

       order by count(*) desc;



Now we move a new student to the area.  This one student will give us quite a bit of data skew, and will be extremely asymmetrical to the other results.


In order to show this in action we'll make a stored procedure that returns our First Name, Last Name, and the Course Name of students by last name.  Remember some students will have multiple courses.  This means we will have more results than we do last names.


if exists(select name from sys.procedures where name='p_sel_get_stu_name')
begin
       drop procedure p_sel_get_stu_name
end
go
create procedure p_sel_get_stu_name(@lname varchar(50))
as
begin
      
       select
              s.FirstName
              ,s.LastName
              ,c.courseName
       from
              dbo.students s
              left join enrollment e
              on s.studentID=e.studentid
              left join courses c
              on e.courseid = c.courseid
       where
                     lastname=@lname

end

So now we will execute this query and see the difference between our query plans and benchmark  the performance.

exec p_sel_get_stu_name 'Bradley' with recompile;



exec p_sel_get_stu_name 'Segarra' with recompile;



The first query took a little over a second to return two rows.  The second query was sub-second and returned 13,843 rows.  Each execution plan was different.  One was parallel, the other was serial.  That makes sense Parallel returned over 13,000 rows, serial only returned 2 row.  The statistical variance is different.  The cardinality estimate gave us different results.

Now let's make this bad.  I'm going to run the first query and second query again.  This time I'm removing the with recompile.

exec p_sel_get_stu_name 'Bradley';
exec p_sel_get_stu_name 'Segarra';

The first query did not change.   The second one did.


We used the cached plan.  Because of data skew we forced 13,843 rows through the serial execution plan.  The result was 8 minutes and 42 seconds instead of a sub-second query.


This is data skew.  We've shown that recompiling the query forces both to execute with their least cost plan.  Is there another option?  In this case we could use the query hint OPTIMIZE FOR UNKNOWN.

The benefit of OPTIMIZE FOR UNKNOWN is that we can remove the recompile.  This will allow us to get the best/least cost plan based on data skewness of the statistics.


if exists(select name from sys.procedures where name='p_sel_get_stu_name')
begin
       drop procedure p_sel_get_stu_name
end
go
create procedure p_sel_get_stu_name(@lname varchar(50))
as
begin
      
       select
              s.FirstName
              ,s.LastName
              ,c.courseName
       from
              dbo.students s
              left join enrollment e
              on s.studentID=e.studentid
              left join courses c
              on e.courseid = c.courseid
       where
                     lastname=@lname
       option (optimize for unknown)
end

exec p_sel_get_stu_name 'Bradley';
exec p_sel_get_stu_name 'Segarra';

Now we execute our procedures and we get our execution plans.  Here are our new query plans.




You'll notice that the execution plan based on statistical variance was parallel plan.  Both queries executed sub-second.  This is not the least cost plan for the first query.  In case you were curious here is a look at the histogram.


WRAP IT UP

So what does this mean?  For the business purpose of  speeding up a query option recompile is completely valid.  

It comes at a cost.  Recompilations, increased CPU utilization, and you loose the history of the execution of the stored procedure from DMVs.  

If those costs do not effect you, or effect the system less than the fluctuation of query performance then it is valid.  

There is also another alternative to use in your tool belt.  That is what we used today.  Like all things in computers use it judiciously.  Test, test, and retest before deploying into production.  As always Dear Reader, Thanks for stopping by.


Thanks,
Brad 

Thursday, October 18, 2012

DBA Study Guide


http://www.flickr.com/photos/caledonia09/4999119065/

 Hello Dear Reader, over here at Pragmatic Works we’ve been growing like weeds.  For the most part we are looking for Sr level people for Sr. Level positions.  Part of that process is interviewing.  Going for a job as a Sr. Consultant is a bit different than going for a job as a DBA. 

Today’s market for DBA’s is quite good.  If you are looking there are jobs out there.  A lot of the time after weeks or months of interviews when the “ideal” candidate has not been found you tend to lower the requirements.   It’s the Animal House “We need the dues” moment. 

Only for a business it is we need a butt in the seat.  You start asking the questions can we find someone with the right attitude, someone who can learn, someone who might not be at the level we want, but we can work with.  Often you can find a diamond in the rough and grow that person into the experience level you wanted.

In the Consultant biz it’s a bit different.  You can do that for Jr or Mid level jobs, but Sr level positions require you to really know your stuff.  You cannot expect a person to know everything, and one person’s Sr is another person’s Jr.  Not to mention there is a wide area of DBA expertise to be considered.  But we have to draw a line in the sand, and Knowledge is very important.


Can you answer some of the following questions:

  1. What is a heap?
  2. What is a Clustered Index, a Non-Clustered Index, and what are the differences between the two?
  3. What is a Page Split?  
  4. What is a Forwarding Pointer?
  5. Why do they matter?


If you cannot then I wanted to toss out some learning resources that cover a wide breath of area.  This is similar to the Microsoft Certification exams where they say know how to Baseline a server, couple different ways to skin that cat, so I know there are a LOT of different things to each very general area.
(*Note no actual cats were skinned in the process of writing this blog).  

This is just a collection of books that I’ve read over the years.  Some go in depth in particular areas, some are general and cover many.   My buddy Mike Davis (@MikeDavisSQL | Blog) wrote a similar list for BI folks if you are interested in that click here to read more.

But I wanted to toss them out so if you are looking for a good book you can find one.  Just looking to grow in a particular area?  Then these will help you as well.


Internals:  If you are looking for a book on Internals you cannot go wrong with Kalen Delaney(@SQLQueen | Blog).  The 2012 Internals book is due out in November, and I can’t wait to read it.  This book has many wonderful contributors and is well worth the money even though a new one is on the way out.  I cannot recommend this book enough.












Internals/Extended Events/Troubleshooting:  Christian Bolton (@ChristianBolton | Blog) put together an All-Star team for this book (a 2012 edition is due out soon as well).  It not only covers internals but tools to diagnose them from some of the Premier experts in the field.  I put this neck and neck with any book.  If you work with SQL Server 2008/R2 you should own a copy.













Query Tuning:  Grant Fritchey (@GFritchey | Blog) is a damn nice guy.  I don’t understand why people think he’s a Scary DBA, (Grant thanks for the advice on the Katana collection and sharpening swords in front of the daughter’s boyfriend before dates, priceless).   I just don’t understand the scary thing at all.  Regardless of his disposition Grant is the guy that wrote the book on Query Tuning and Execution Plans.  He is a master in this field and the only people I would regard higher are the people Grant would recommend.














Clustering:  Alan Hirt(@SQLHA | Blog) is to clustering what Grant Fritchey is to Query Tuning.  I’ve attended Alan’s pre-con’s, read his books, and watched his generous and free advice via #SQLHelp.  If you are working in clustering you should have Alan’s book it will point out best practices and save you head ache’s (I’m looking at you government SOC’s Image when setting up a 2008 Cluster).













Replication: I wanted to recommend a replication book however, I haven’t purchased this one.  My friend and co-worker Chad Churwell (@ChadChurchwell | Blog) is one of the smartest replication guys I've ever met and he recommends it. I’m making the recommendation because of Chad and I have done more replication as a Consultant that I did as a DBA.  I’ve set it up, I’ve fixed it, I’ve learned how to find out when it’s broken, what broke it, and why.  I’d also bet I’m not alone.  I’ve only read the free pre-view of the book and chapter wise it summarizes everything I’m looking for an Expert in Replication (other than experience).














Mirroring:  I would put Robert Davis(@SQLSoilder | Blog) in the realm of Mirroring what Grant and Alan are to their respective books.  Robert has blogged incredibly useful and real world information about mirroring.  AND YES I understand that Always On Availability Groups are the way to go.  However, not everybody is on SQL 2012, and a solid understanding of Mirroring allows you to better understand all the goodness that is Always On Availability Groups.














Hardware and Virtualization:  When it comes to hardware you don’t get much better than Glenn Berry(@GlennAlanBerry | Blog).  From his free Assessment Scripts on SQL Server Performance (Glenn's is here) is essential when you go onto a new server for the first time and try to holistically figure out what is going on right and wrong.  The first chapter alone taught me more about CPU’s and which to choose than years of experience had.  I was able to use this knowledge immediately.













Performance Indexing: Jason Strate (@StrateSQL | Blog) and Ted Krueger (@Onpnt | Blog) are incredibly smart guys.  SQL MVP’s, years of experience, and deep knowledge all combine to give you an answer to the age old question ‘What should I index and Why?’.  Indexing is a core thing that DBA’s should know about.  Adding, removing, finding good ones, and identifying bad ones are important.  Not to mention the answer to all of my previous questions are in this book.










SQL Server 2012/ SQL Azure/Powershell:  I work with some pretty smart guys.  SQL MVP’s, Consultants, and their friends are just as smart.  These two books are a collabertaive effort between brilliant people  Adam Jorgensen (@AJBigData | Blog), Brian Knight (@BrianKnight | Blog), Jorge Segarra (@SQLChicken | Blog), Patrick Leblanc (@PatrickDBA | Blog), Aaron Nelson (@SQLVariant | Blog), Julie Smith (@JulieChix | Blog)…And MORE (sorry for the people I left out)!  If you are looking for information on SQL 2012 and how to use it go to the Bible and their other book on Professional Administration.















 WRAP IT UP!

A lot of books I know and no I don’t expect you to read all of them before an interview, but there are a lot of common theme’s in the world of SQL Server.  A good expert should be EXCITED about what they learn about.  They should be able to pick something tell me what they know, and I’d like them to do it in a way that I’m excited about it by the time they finish.

I love going to SQL Saturday’s, PASS Events, and Conferences because they make me excited to learn.  And I really love to learn.  Find something that you are passionate about, and learn it really good.  That kind of learning and passion is infectious and is exactly what makes all of the authors I’ve mentioned such great SQL Server professionals. 

Hopefully, whether you’re looking for a job or not, it will help you find something that you love to learn about.

Thanks,

Brad


Thursday, April 28, 2011

Transparent Data Encryption & Backwards Compatibility

We were working with Chris Mitchell (Twitter) from the Microsoft Technology Center in Atlanta the other day, out of the Microsoft Office in Tampa.  The Office Has a Beautiful and very distracting view, see the picture below, of the Bay area, my friend Dan Taylor(blog|Twitter) took the photo.



One of those views that makes you love living in Florida!  

But I digress, we were having a really great conversation about features that are enabled or disabled when you enter a database into 80 Compatibility level, but are running on a 2008 R2 Instance.  

When having discussions you typically throw things out by features (or at least that is what I’ve heard people do), Database Compression will not work, Filestream will not work, Encryption will not work.  But when we brought up encryption I asked does that mean Transparent Data Encryption will not work.  Neither of us knew off hand.

I’ve presented on this topic so I’ve got some scripts on standby, so I opened one up fired it off, changed the Compatibility Level, and it still worked. 

“So Balls”, you say “Prove It.”

Dear Reader I wouldn’t have it any other way!

AND….. HERE….. WE…… GO!


First we’ll create our Demo Database

/*
Create TDE Database for demo
*/
USE master
GO

IF EXISTS(select * from sys.databases where name=N'TDE')
BEGIN
    DROP DATABASE TDE
END
CREATE DATABASE TDE
GO

When we use Transparent Data Encryption we need to First Create a Master Key and a Certificate in the Master Database.

/*
Create the Master Key for the
SQL Server
*/
USE master
GO
CREATE MASTER KEY ENCRYPTION BY PASSWORD='brad1'
GO
/*
Create the Certificate that will be used
for database level encryption
*/
CREATE CERTIFICATE DatabaseCertificate WITH SUBJECT='NunyaBiznes'
GO

Now that we’ve got those let’s alter our databases compatibility level to 80, SQL 2000.

USE TDE
GO
/*
Let's Alter Our Database
And Place It in 80,SQL 2000,
Compatability Mode
*/
ALTER DATABASE TDE
SET COMPATIBILITY_LEVEL = 80

Now let’s Create our table, and Insert some data.  You’ll see that I have default values set up to simulate an SSN. 

/*
Create Table for filler data
*/
IF EXISTS(SELECT * FROM SYS.tables WHERE name='tdeData')
BEGIN
    DROP TABLE dbo.tdeData
END
CREATE TABLE dbo.tdeData(
    ID int IDENTITY(1,1) NOT NULL
    ,nameText varchar(100) default 'fakeSSN'
    ,ssnText varchar(100) default '111-11-1111'
    ,fillerText char(5000) default 'a'
)
GO
/*
Create filler data for TDE demo
*/
DECLARE @i int
SET @i = 0
WHILE (@i < 15000)
    BEGIN
         INSERT INTO tdeData DEFAULT VALUES
         SET @i = @i +1
    END

I do this in Demo’s because I want people to see that when you insert data into a Data File or a back it up to a Backup File, the data is in plain text.  That is part of the reason you use TDE, because it adds an additional layer of protection.  So let’s backup our data Pre Encryption, and look at it in a Hex Editor.


Look at that there is our social security number 111-11-1111!  Now let’s double check our Compatibility Level. 


And now let’s enable encryption. We create a Database Encryption Key using our Certificate we made earlier, and specify our algorithm.  Then we set the database encryption to on.

/*
Create Database Encryption Key
*/
USE TDE
GO
CREATE DATABASE ENCRYPTION KEY
WITH ALGORITHM = AES_256
ENCRYPTION BY SERVER CERTIFICATE DatabaseCertificate
GO


/*
Set The Encryption On
*/
ALTER DATABASE TDE
SET ENCRYPTION ON

 We’ll use a DMV to validate the encryption process that SQL MVP Jasper Smith(blog) wrote that I’ve loved to use for years now.

SELECT
    CASE e.encryption_state
                WHEN 0 THEN 'No database encryption key present, no encryption'
                WHEN 1 THEN 'Unencrypted'
                WHEN 2 THEN 'Encryption in progress'
                WHEN 3 THEN 'Encrypted'
                WHEN 4 THEN 'Key change in progress'
                WHEN 5 THEN 'Decryption in progress'
    END AS encryption_state_desc,
      e.percent_complete,
    DB_NAME(e.database_id) AS DatabaseName,
         c.name as CertificateName,
            e.encryption_state
    FROM sys.dm_database_encryption_keys AS e
    LEFT JOIN master.sys.certificates AS c
    ON e.encryptor_thumbprint = c.thumbprint
   
You just execute this after you set Encryption On and watch the internal process work, the size of the database will determine how long it runs.  For a large database this is a great DMV to show you just how long the process is taking.  And now we are encrypted! 



So let’s do another backup and open that up in a Hex Editor to see if we can find our SSN.


And as you can see our file space looks quite a bit different, and we couldn’t find the string for our SSN

WHAT IN THE NAME OF PRO-WRESTLING DOES THIS MEAN!?


First off it means that if you have to keep your database in 80 Compatibility Level, you can still use Transparent Data Encryption to secure your backups and your Data Files at rest. Your log files will be encrypted too, but this occurs at a VLF, virtual log file, level.  So the full log will not be encrypted until all VLF's have been over written.

Some other pretty cool possibilities, I asked Paul Randal (blog|twitter) what part of the Storage Engine handled the Encryption and Decryption of pages. 

I know Access Methods handles Compression, because the pages are stored in memory in a Compressed State before being passed back to the Relational Engine.  But Pages are decrypted when stored in memory?  So I didn’t think it was the Access Methods. 

Paul confirmed that it was indeed the Buffer Manager that handles TDE. 

So that leads me to think some ROCKING things may be possible, if you know what the Buffer Manager IS COOKIN!



Thanks,

Brad