Monday, October 10, 2011

SQL University Compression Week: Lesson 1 About Compression

Hello Dear Reader!  Welcome to Lesson 1 on Compression.  I want to thank the Chancellor of SQL University Jorge Segarra (@SQLChicken | Blog)  for the opportunity to be your  Professor of Compression.  Before we talk about how to turn it on, where we should apply it, or even the types of compression that there are available to us there are two burning questions that we need to answer.  What is Compression and Why should I use it?

Let’s get a 40,000 foot overview of Compression.  When the good folks at Microsoft were looking at adding Compression into SQL Server the first thing they had to answer was, What kind of Compression will we use?  If you are new to the Compression game then you may not know that there are a lot of different ways to Compress Data.  Sure they all do the same thing, reduce the overall size and foot print of the Data you are Compressing, but we use different types of Compression differently.

On one end of the Scale you have Compression that reduces the overall size of the file but it does so buy stripping out what it deems as unneeded or necessary data.  The best example I can think for this is photography.   A RAW photograph file is huge.   But it contains every last bit of detail that your camera can capture.   When you look at some of the breathtaking pictures that line the magazines of scenic landscapes, athletic accomplishment, or new events, those photographs were taken in a format to capture all of the data.  When you browse USA Today on your mobile phone, or on your computer you receive this photograph in JPEG, GIF, or PNG format not in RAW.  The reason it’s smaller.  

“But Balls,” you say, “How did it get smaller?” 

The answer Dear Reader is Compression.  JPEG, GIF, and PNG are Compressed file formats for pictures.  Why do we Compress pictures?  Because the human eye can only detect so much detail.  Some people can detect more than others, but eventually we hit a threshold were we cannot detect anything more.  So the extra Data becomes unnecessary overhead at that point.  So we strip it out.  

Jim Grey went through the trouble to come up with ACID, We Need to Talk Your Database is on ACID, and Microsoft had adopted it into the core programming of SQL we can assume that they would not want to strip out data.  As Database professionals we would hope that we are preserving our data, and we are.

So we know the kind of Compression that we will use must preserve our Data,  the next question is how hard do we want to work for it?   We could use something like Winzip, and Compress our files down to the lowest possible value, preserving our data and getting it really really small.  However that only works if you aren’t planning on doing that every time you want to read it.  Have you ever zipped a 2 GB file?  It takes a little while.  Have you ever unzipped a 2 GB file, takes a little while too.  Imagine if every time you wanted to read your multi GB Database you had to sit through the same scenario.

We are still operating a Database, while reducing our Data footprint is laudable we wouldn’t want to do it at the expense of performance.  As a matter of fact if at all possible we would want performance to increase not decrease.

I know we are picky and asking a lot, Shrink the data, but don’t loose any, make it smaller but easy to access, and if possible make it work faster (in some not in all cases).   Microsoft listened and the particular flavor of Compression we got was Page & Row Compression. 

The first stab at Compression actually came up in SQL 2005 SP2 and it was called Vardecimal Compression.  It only worked on the Numeric and Decimal Data Types, but it used concepts that we will be reviewing when we discuss Row Compression.   In SQL 2008 we got Page and Row Compression and Vardecimal Compression was immediately deprecated.  You can still use Vardecimal Compression in 2008, but if you really want to use Compression you will get a lot more bang for your buck with Row Compression, more on that tomorrow.


Now that we understand the basic idea of what Compression is, the most important question you will ask is Why should I use it!  Disks are cheap.  I hear it all the time.  You’ve probably seen the ad’s for 1 TB hard drives for sale on Black Friday for $30.  You are correct Disks are cheap and they are getting cheaper all the time, however, I bet you don’t go and buy the Hard drives for your server on Black Friday.

As a matter of fact if you have SAN, Shared Attached Network storage?, then you need to buy different disks altogether.  You probably need to have a port in the back of them that allows for a HBA, Host Bus Adaptor, that takes a fiber cable.  The reason?  If you are using a SAN device you want the quickest possible connection for your database server.   And those HBA ports raise the price of those hard drives.  And if you are using Solid State hard drives on a SAN that price goes up even more.  Now we are talking from 10’s of dollars to 100’s of dollars.  And if you have a SAN your not going to have just one hard drive in there.  You will have multiple drives striped in some sort of RAID Array.  How cheap does it sound now?

But wait there’s more.  If you have a lot of SAN’s that can get pretty complicated.  Your probably going to need a SAN administrator.  And you may need more than one if you are a big enough shop.  So now we’re not just talking hardware but people as well.

All this for a database?  One lowly single database?  I would say most of us don’t have one Database to worry about.   You probably have a Prod (Production) environment, maybe a UAT/Staging, a Test, a Dev (Development) or even a DR/COOP.   You may have a Data warehouse as well.   You could have multiple copies of your database for multiple projects.

And you probably don’t have just one Application you are supporting.  Most places have 10’s, 100’s, and in some cases even 1,000’s of Database across just as many instances that they maintain.  When you start thinking about the scope that your data touches, you start to realize just how much you have out there.

As Database professionals we try to normalize our data to prevent data duplication, but by nature of the change control process we will always have it.  On top of that there are 3 performance bottle necks we normally encounter; CPU, Memory, and IO (Input/Output). 
The option to re-write code is always there, but how often have you been in a situation where management decides to throw heavier hardware at a problem because it is cheaper and more cost effective than a re-write?  If you haven’t been there you will eventually.  You can toss more CPU on a server to get more processing power.  More RAM to get more memory and to speed up operations because we can hold more in memory.  But with Disk, if I throw more disk at a server I just get more storage room, it doesn’t help me with performance.

Page and Row Compression can do just that.  The cost to the server is CPU cycles which most database servers have to spare.  *Just a side note if your server is averaging around 60% CPU utilization or higher I would not recommend using Compression without thorough testing to ensure it would not reduce existing performance.  We reduce the size of the structure not only on disk but in memory.  Pages of a databases are compressed on DISK and in Memory.  This is such an important concept I’m going to say it one more time, on DISK and in MEMORY!  So not only do you save room, reduce the size of a table or index.  Make it quicker to read off of disk because of a reduced size, but you take up less Page Cache because your Data Pages are compressed in memory.

 So  you can reduce your data foot print across multiple environments, possibly speed up performance by reducing the number of reads to process a query, and possibly use your memory cache more efficiently.   I don’t think the question is why should I use Compression, but why am I not using compression.


I’m not going to sell you a bill of goods, that is just snake oil.  There are some serious things that you need to keep in mind when it comes to Compression.
  1. Know Thy DATA!
  2. Take Baselines
  3. Know Thy DATA!

I like to think of Data Compression like Indexes.  You wouldn’t just put an index on a table and never check to see if it is being used.  You would check to see that the query plans you have are still running efficiently, and you would monitor your performance.

You need to do the same thing with Compression.  Know how your queries perform before hand, know how they perform after Compression.  If your CPU shoot’s up you will need to trouble shoot it, we’ll cover this later in the week.  If your data types are not right, don’t compress them in the first place, later in the week as well.


Your Assignment should you choose to accept it is to go read some of the Blogs by the SQL Server Storage Engine Team on the Topic of Compression, Sunil Agwilari is the Project Manager for the Engine Team and he has written a lot on this topic.

Data Compression: Why Do we need it?

Data compression techniques and trade offs



No comments:

Post a Comment