Wednesday, August 8, 2012

How to Data Compress Varchar(MAX)



I talk a lot about compression.  I’ve blogged a pretty decent amount on it as well.  One of the things that often confuses people is what can and cannot be compressed.  There is a list of data types that can be Row Compressed.  That list is different between each SQL Version.  Page compression on the other hand works at the binary level, it is data type agnostic.

The big determining factor is what type of Allocation Unit your data is stored on.

“Balls,” you say “What’s an Allocation Unit?”

An Allocation unit is the structure behind the structure.  Think of real estate for a second.  Buildings and property are zoned in a city or a town.  One section is for businesses, another is zoned for residential, one may be zoned for the government.  In SQL Server we have 3 different zones IN_ROW_DATA, ROW_OVERFLOW_DATA, and LOB_DATA. 

Instead of being sized just for type, your size matters just as much.  If you are a regular every day Integer or Character field you live in IN_ROW_DATA.  You are LOB_DATA if you are a VARBINARY(MAX) that contains a 500 MB picture file.  ROW_OVERFLOW_DATA are variable length fields that start off on IN_ROW_DATA pages, but if that data grows large enough that it cannot fit on an 8 KB IN_ROW_DATA page then it gets popped off the IN_ROW_DATA Page and lands  on the ROW_OVERFLOW_DATA Page.

The data types in SQL that have a (MAX) designation, XML, or certain CLR types start off on IN_ROW_DATA pages.  They get moved off if the size grows.


HOW IN THE WIDE WIDE WORLD OF SPORTS


So how in the wide wide world of sports does this apply to Data Compression?  If your data is on an IN_ROW_DATA page it could be compressed.  Row compression still only applies to the data types that are listed per version, see row compression here at MSDN.

Page Compression only requires matching binary patterns, as long as it is IN_ROW_DATA pages we are good to go.  You can use this script to run against your database to get the Allocation Unit makeup of your tables and indexes.

SELECT
     OBJECT_NAME(sp.object_id) AS [ObjectName]
     ,si.name AS IndexName
     ,sps.in_row_data_page_count as In_Row
     ,sps.row_overflow_used_page_count AS Row_Over_Flow
     ,sps.lob_reserved_page_count AS LOB_Data
FROM
     sys.dm_db_partition_stats sps
     JOIN sys.partitions sp
           ON sps.partition_id=sp.partition_id
     JOIN sys.indexes si
           ON sp.index_id=si.index_id AND sp.object_id = si.object_id
WHERE
     OBJECTPROPERTY(sp.object_id,'IsUserTable') =1
order by sps.in_row_data_page_count desc

The higher the IN_ROW_DATA page count the more likely you have a candidate for compression. 

ON TO THE MAIN EVENT





We’ve laid the ground work now on to the main event.  First we’ll create our database and  our table and insert some data.  I’ve got two Varchar(Max) fields, we’ll put 2012 characters in each. 
/*
Select our demo database
to use
*/
use master
go
if exists(select name from sys.databases where name='demoInternals')
begin
     alter database demoInternals set single_user with rollback immediate
     drop database demoInternals
end
go
Create Database demoInternals
go
USE demoInternals
GO
/*
Create our table
*/
IF EXISTS(SELECT name FROM sys.tables WHERE name='vmaxTest')
BEGIN
     DROP TABLE dbo.vmaxTest
END
GO
CREATE TABLE vmaxTest(myid int identity(1,1)
     , mydata varchar(max) default 'a'
     ,mydata2 varchar(max) default 'b'
     ,CONSTRAINT pk_vmaxtest1 PRIMARY KEY CLUSTERED (myid))
GO
/*
Insert 5000 rows
*/
DECLARE @i INT
SET @i=0
WHILE (@i<5000)
BEGIN
     INSERT INTO vmaxTest(mydata, mydata2)
     VALUES(replicate('a',2012)+cast(@i AS VARCHAR(5)), replicate('b', 2012)+cast(@i AS VARCHAR(5)))
     SET @i=@i+1
END
GO

If you use our script from earlier then you can see we have 4950 IN_ROW_DATA Pages.

Now let’s update one of our Varchar(max) fields to 8000 characters so that we push it off of IN_ROW_DATA and over to LOB_DATA Pages.   Run our script again to get our counts.
/*
Now we'll update just the b values
to force them into row_overflow data
pages
*/
UPDATE dbo.vmaxTest
set mydata2=replicate('b',8000)

We certainly have some fragmentation, but we’ve added 5009 LOB_DATA pages to the mix.  Now let’s apply Page Compression and use our script again to see the results.
/*
Rebuild our table with
Page Compression
*/
ALTER TABLE dbo.vmaxtest
REBUILD WITH(DATA_COMPRESSION=PAGE);
GO


As you can see the IN_ROW_DATA Compressed, the LOB_DATA didn’t.  Another way that knowing thy data can help you understand what you can and should compress.

Thanks,

Brad


5 comments:

  1. I'm curious, have you tried moving your LOB to FILESTREAM and using operating system level compression on the FILESTREAM folder?

    ReplyDelete
  2. I have not, but I'll try to take a look at that in a future blog. That should be interesting!

    If you truly need the 2 GB storage of Varchar(max) Filestream is probably the way you want to go. If you have rouge developers that do not know any better and just want a varchar(max), then this is probably the type of compression that would be most helpful :)

    ReplyDelete
  3. I have run the estimator tools on Geography type and it crashes it seems that CLR data types are not compressible, more specifically if the value from sys.types is_assembly_type columns = 1 seem to not be compressible.
    Is this a correct assesment of the compressability of these data types?

    ReplyDelete
  4. Hi Bradley

    I came across this post when trying to answer some questions from students in class. You seem to imply that only data on the IN_ROW allocation unit can be compressed. Your script then says you will update values to move rows to ROW_OVERFLOW, but you are actually moving them to LOB! So what about ROW_OVERFLOW? I modified your script to use columns that actually would be moved to ROW_OVERFLOW and found that was compressed as well as the IN_ROW data. So I appreciate your script as a starting point. Thanks

    ~Kalen Delaney

    ReplyDelete
  5. wow really handy and well explained

    ReplyDelete