Tuesday, April 19, 2011

Let's Talk About Indexes: An Introduction

A very good friend of mine, whose got this kick-ass job working for the President J, wrote me a question about indexes.  And I thought how better than a phone call and a follow up blog talking about indexes in general.  The question centered on the uses of Clustered Indexes vs. Non-Clustered Indexes and the use for each.

CLUSTERED vs NON-CLUSTERED


The good news is this really isn't a VS kind of thing.  Clustered Indexes and Non-Clustered Indexes are both indexes.  They are both made of of an internal B-Tree structure.  




When you think of a B-Tree, think of your standard Hierarchy.  There are 3 named levels to a B-Tree structure.  The Root Level which is our top level and contains values that direct a path towards the next level, the Intermediate level and just like the Root Page points to the next level of Data, and finally the Leaf Page.

One of difference between a Clustered Index and a Non-Clustered Index is that a Clustered Index physically sorts the Data, based off of the Clustered Index.  Another difference is you can only have one Clustered Index per Table, but you can have multiple Non-Clustered Indexes.  The Clustered Index is most commonly thought of as a Primary Key, (the Clustered Index doesn't have to be the Primary Key, but that could get a little confusing so we'll save that for another day).

So we'll say that our B-Tree up above is using the letters of the Alphabet as it's Clustered Index.  I'll issue a quick T-SQL Select to get our letter.

SELECT
    *
FROM dbo.alphabet
WHERE
    letter='G'

If I wanted to go get the letter G, at the Root Level I deduce that G comes after A and before J so I would take a pointer and go to the Intermediate level that contained A, C, F, and H, From there I would further deduce that G is after F and before H, and I would travel down to our Leaf Page where I would find the letter G.

*IF your query's first Search Argument is the Clustered Index, and your range of values are narrow, then you will use your Index structure in a manner called a Seek.  If you have to read the entire table to get the contents of your information back, it is called Scan.  Scan's have their place and are not necessarily a bad thing.  But it is entirely dependent on the amount of data being returned, and the route taken to search for it.



One of the best examples I can think of to describe what an Index does, and further illuminate our B-Tree Structure, and show how  Clustered Indexes and  Non-Clustered Indexes  work together is the good old Phone Book.  If I told you to find Bradley Ball's address in the phone book you've got 2 ways of finding my name.

1. Turn through the book 1 page at a time until you get to Bradley Ball, and then read his address.

2. Flip to the B's (our Clustered Index) and scan until you get to Ball.  Now Scan Ball for the first name of Bradley (our Non-Clustered Index), and find the address.  You just performed a Seek, using the search arguments of Last Name and First Name to find Address.  

So now let's toss out a little vocabulary, to help further this discussion along.

HEAP- A heap is a table that has no clustered index.  Data is stored on a head in the order that it is inserted.  This means it is easily fragmented, and could require scanning as selecting data would require a full scan of the heap in order to find it.

CLUSTERED INDEX- A Clustered Index is a Unique value, that determines the physical sort order of the data as it is stored on a page.  The leaf level of a Clustered Index is the data of the table.  Clustered indexes have a limited key size of 900 bytes.

NON-CLUSTERED INDEX- A Non-Clustered Index is an index placed on a row in a Table, the leaf level of a Non-Clustered Index contains the data that makes up the Non-Clustered Index, as well as a physical pointer to the leaf level of the Clustered Index.  You can have many Non-Clustered Indexes on a Clustered Index Table, or a Heap.  Non-Clustered Indexes have the same 900 byte limitation and are limited to 16 columns, you can get around this using Include Columns.

"So Balls", you say, "Indexes sound great, we should put them on EVERYTHING! Right?"

I'm Glad you asked Dear Reader, and the response, regarding OLTP systems, is NO.

BALANCING ACT

Every Index that you place on a table has a cost of additional overhead.  You want to use them to speed up your database, but if you're not careful you could grind it to a halt.  

If you have a table with a Clustered Index, when you insert and update pages, and GROW the data, you are increasing your over head, slightly and it's worth it, to make sure that your Root and Intermediate level pages have the correct values to point to your leaf pages.

When you add a Non-Clustered Index, every time you insert a value for that Index you write to your table AND your Non-Clustered Index.  So if you Add another, and another, and another, then suddenly every time you write 1 record to your table, you could have many, many, many writes to keep all of your Indexes updated.

One of the first things I look at when I get a call about a "performance" problem on a database is to look at the indexes.  I've had instances where I have found tables with more indexes than columns.


COVERING INDEXES

The final high level topic that I want to talk about is Covering Indexes.  A Covering Index is an index that satisfies all of the return requirements of a query.  You could take all of the fields that you want to return, limit 16, in a query and if you add them to a single Non-Clustered Index you can do so.

One way to make a covering index is to add each column to the Non-Clustered index, another is to add additional columns as Include Columns.  Include columns allow you to get around the 900 byte limitation of an index, because Include columns are not stored as key values.  There are some data types that are not allowed to participate in Included Columns, Text and Image datatypes.  You can have 1023 Included columns on an index.

I just want to point out thought that, 1023 should not be a goal.  Every value you add, will add additional overhead to your leaf level Non-Clustered index.  Be sure to take baselines before and after, in order to know if your Index changes are having a positive affect on your query performance.




Thanks,

Brad


No comments:

Post a Comment