Thursday, June 16, 2011

SQL Rally Q & A Row Compression with Variable Legnth Unicode


This is another follow up on the Q & A’s that came about from SQL Rally, this question was asked by Jim Murphy (@SQLMurph | Blog).   This wasn’t so much of a question as it was a discussion during the presentation. 






I have Slide, see below, that lists the data types that can use Row Compression.  



Row Compression specifically takes all the empty space out of Fixed Length Data Types and stores them as efficiently as possible.  So in essence we take a fixed length field and treat it as if it has a variable size.

Jim very astutely pointed out that I had nvarchar listed as a Data Type that compresses even though it was not a Fixed Length Data Type and Jim wanted to know how Compression worked on an nvarchar Data Type.  To be perfectly honest I botched the answer.  But it led to a wonderful learning opportunity for me.

WHAT IN THE UNIVERSE IS UNI-CODE





The short answer is it is a Universal Character Set that allows for many non-English Characters and is governed by the International Standards ISO/IEC committee.  The Data Types in SQL Server that use Uni-Code are NCHAR & NVARCHAR and they use the UCS-2 character set.  UCS-2 allows exactly two bytes to represent each character.


So what in the wide wide world of sports does this have to do with Compression.  This means that when I declare an NCHAR with a 250 length that it is actually a 500 byte value.  When I type ‘Bradley Ball’ into an NVARCHAR data type instead of taking up 11 bytes of space it takes up 22.  So this encoding can utilize a lot of space that it will need when moving the data to the user, but not necessarily in Storage.


SQL 2008 R2 introduced Unicode Compression to SQL Server.  So to get back to the original question:

“Row Compression works by taking fixed length strings and Compressing out the unused white space from the record, NVARCH is a variable length string how does Row Compression work with a Variable Length Data Type?”


The short answer it strips out the unused extra encoding if there is any to strip out.  The long answer, TO THE DEMO MOBILE!

DUHNA DUHNA DUHNA DUHNA DEMO

Coolness is not Compressed for Batman

So Let’s start off by making our database.

USE master;
GO
IF EXISTS(SELECT name FROM sys.databases WHERE Name=N'demoCOMPRESSIONInternals')
     BEGIN
          DROP Database demoCOMPRESSIONInternals
     END
    
CREATE DATABASE demoCOMPRESSIONInternals
GO

USE demoCOMPRESSIONInternals
GO

Then we’ll create our table and populate it with a couple rows.

IF EXISTS(SELECT NAME FROM sys.tables WHERE name=N'myTable1')
BEGIN
     DROP TABLE dbo.myTable1
END

CREATE TABLE myTable1(
     myID INT IDENTITY(1,1)
     ,productName NCHAR(500) DEFAULT 'some product'
     ,productDescription NVARCHAR(1000) DEFAULT 'Product Description'
     ,PRIMARY KEY CLUSTERED(myID)     
) ; 

DECLARE @i INT
SET @i=0

WHILE (@i<5)
     BEGIN
          INSERT INTO dbo.myTable1(productName, productDescription)
          VALUES(
                   ('some product' + CAST((@i +1) AS VARCHAR(5)))
                   ,('Here is a Generic Product Description' + CAST((@i+2) AS VARCHAR(5)))
                   )
             

          SET @i = @i +1

     END

We’ll set on Trace flag 3604, and we’ll do a DBCC IND to get our page number, *Page Numbers will vary when you do this on your own.


DBCC IND(demoCOMPRESSIONInternals, 'mytable1', 1)
go

DBCC TRACEON(3604)
Go

And here are our pages.


Now a DBCC Page on page 153.

DBCC PAGE('demoCOMPRESSIONInternals', 1, 153, 3)
GO

Now let’s grab an edited look at our records, we mainly want to see the Length to each record in its pre-compressed state.

Slot 0 Column 1 Offset 0x4 Length 4 Length (physical) 4

myID = 1                            

Slot 0 Column 2 Offset 0x8 Length 1000 Length (physical) 1000

productName = some product1                                                                                     
                                                                                                                 
                                                                                                                
                                                                                                                 
                                                                         

Slot 0 Column 3 Offset 0x3f7 Length 76 Length (physical) 76

productDescription = Here is a Generic Product Description2               

Slot 0 Offset 0x0 Length 0 Length (physical) 0

So we can see that even though Slot 0 Column 3 only contains 38 characters it’s length is 76.  You can see the doubling of the fixed length field as well looking at Slot 0 Column 2, even though it is an NCHAR(500) it has a length of 1000 bytes.  Now let’s apply Row Compression.


ALTER TABLE dbo.myTable1
REBUILD WITH (DATA_COMPRESSION=ROW)
go


We need to do another DBCC IND to get our newly rebuilt pages.

DBCC IND(demoCOMPRESSIONInternals, 'mytable1', 1)
go

Here are the new pages.



Now let’s do a DBCC Page on page 155 and let’s take a look at our data page and get the lengths.

Slot 0 Column 1 Offset 0x4 Length 4 Length (physical) 1

myID = 1                             

Slot 0 Column 2 Offset 0xc Length 1000 Length (physical) 13

productName = some product1                                                                                     
                                                                                                                
                                                                                                                
                                                                                                                 
                                                                         

Slot 0 Column 3 Offset 0x19 Length 76 Length (physical) 39

productDescription = Here is a Generic Product Description2              

Slot 0 Offset 0x0 Length 0 Length (physical) 0

We see that the length is reduced greatly.   Slot 0 column 1 going from 4 bytes down to 1.  We see that Slot 2 went from 1000 Characters down to 13.

But the answer to the question lies in column 3 which went from 76 bytes down to 39.  When your nvarchar strings get to be even longer that get’s to be more space savings.  We did a nvarchar 1000.  If we maxed out that value it would be 2000 characters.

So let’s do that, we’ll update our values to max out our strings.

DECLARE @i INT
SET @i=0

WHILE (@i<5)
BEGIN
     UPDATE dbo.myTable1
     SET productname = (REPLICATE('a', 499) + CAST(@i AS VARCHAR(1)))
     WHERE myID=@i

     UPDATE dbo.myTable1
     SET productDescription = (REPLICATE('b', 999) + CAST(@i AS VARCHAR(1)))
     WHERE myID=@i

     SET @i=@i+1
END

And now we’ll take a look at page 155 again.
DBCC PAGE('demoCOMPRESSIONInternals', 1, 155, 3)
go
And what are the lengths of our slots?
Slot 0 Column 1 Offset 0x4 Length 4 Length (physical) 1

myID = 1                            

Slot 0 Column 2 Offset 0xc Length 1000 Length (physical) 501

productName = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa1           

Slot 0 Column 3 Offset 0x201 Length 2000 Length (physical) 1001

productDescription = bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbb1                                

Slot 0 Offset 0x0 Length 0 Length (physical) 0

WOW, so even when we max out our fields, Row Compression still provides us with savings.  Slot 0 Column 2 is 1000 bytes in length, but it’s physical storage is only 501 bytes, and Slot 0 Column 3 our variable length field is physically 2000 bytes but is stored in 1001 bytes.

I’m sure Dear Reader you have noticed that instead of an even 1000 it is 1001, and I would love to tell you that I know why, but I don’t.  I suspect that the 1 byte is some sort of internal byte that contains the information needed for a Unicode Character to be translated by the access methods after compression.  But your guess would be as good as mine.

Thanks,

Brad


1 comment:

  1. Thanks for the followup Brad. It is cool how SQL Server rips out the 2nd character in a Unicode string if it is not used. Very cool.

    ReplyDelete