Moer and Éric Moreau's web site

(Print this page)

Column Store Index

Published date: Saturday, November 24, 2012

On: Moer and Éric Moreau's web site

Original URL: http://www.emoreau.com/Entries/Articles/2012/11/Column-Store-Index.aspx

The ColumnStore index is a new of feature of Microsoft SQL Server 2012. If used correctly, it can greatly improve query performance in some specific scenarios.

Because of the nature of a data warehouse, this new feature is more useful with that kind of databases. Some people have seen improvements in speed up to 100 times. But due to ColumnStore limitations (more on this later) don’t even think to use it in OLTP databases.

Maybe the biggest advantage of a ColumnStore index is that because we often have duplicates in columns, this kind of index can aggressively compress the data and thus reducing costly disk I/O.

If you ever heard of Vertipaq, the ColumnStore index is using this technology (which belongs to them). Other words like xVelocity and Apollo are related to the ColumnStore index feature.

What you need

If you want to test the queries shown here related to ColumnStore index, of course you will need Microsoft SQL Server 2012. Don’t even try this on older versions, it just won’t work.

RowStore vs. ColumnStore

These are the 2 kinds of storage offered by Microsoft SQL Server 2012.

The row store is probably what you are used to. All the columns of a record are stored side-by-side on a single line in a page. There will be as many rows as the page can contain (which varies according to the record length). An example of a RowStore is shown in figure 1.

Figure 1: RowStore index

On the other hand, ColumnStore index will create pages based, as its name says, on columns rather than rows. Each individual column will be stored in different pages and a page.

Figure 2: ColumnStore index

Impact on performance

You now might wonder how the layout can impact on the performance so greatly. The reason is simple. Because your pages only contains a single columns, a page contains much more of that information you are working with. That means that you have much less disk I/O to retrieve the data you need. Secondly, because the data is more easily compressed, it takes less space in memory. That means that you can have more data in memory and memory operations are much faster than disk operations. These are known as the buffer cache hit ratio (the greater this value is a synonym of performance).

Reducing disk I/O by limiting the number of pages

In a RowStore index, if you run a query retrieving only some columns from the table shown in Figure 1 (SELECT C1, C2, C3 FROM T1), the pages processed by the queries will contains all the columns (including C4 to C10) even if we are not using them (I am not saying that they will be returned in the result set, but they will at some point be transferred to memory before being removed). If you have a very large table, this behavior will generates a lot of useless disk I/O. The same query against ColumnStore index will only handle pages containing C1 to C3. The data of C4 to C10 will never be processed.

Reducing disk I/O by compressing the data

For example, if your table contains addresses and your column C3 contains cities, chances are that you have multiple addresses with the same city. If C3 is stored on its own page, the redundancy of the cities across your records will result in a better compression (compared to a RowStore index).

When to use?

This ColumnStore index feature is a serious contender for performance improvement when you are building data warehouses, especially if your column values have redundancy (like the city).

A single table can have both RowStore index and ColumnStore index. So that means it is not an “all or nothing” solution. When you have both kind of indexes, the query optimizer will decide which one to use.

You should use ColumnStore index on large tables. In data warehouses, your fact tables usually get benefits. This not true with your dimension tables which normally contains lot less data. If your dimension table contains more than a million rows, only then you should consider a ColumnStore index.

When not to use?

If your database is used heavily by an OLTP application, ColumnStore index won’t work for you. Tables with a ColumnStore index are read-only.

Also, if your queries often return a few rows, using a ColumnStore index might with more disk I/O has the number of pages might be greater (since each column is stored in its own set of pages).

Limitations

I am not sure if it is the nature of the ColumnStore index or because it is its first version, but ColumnStore suffers from many limitations. Amongst these, we find:

Only one ColumnStore index can be created on a table. But this index can contains many columns (up to 1024 columns). Very often, it will contains all the columns of the table.
Another limitation is that the table is read-only while a ColumnStore index is being created. That means that you need to be very cautious when creating that kind of index on a large table.
Loading or updating a large number of rows of a table with a ColumnStore index might take more time than dropping the index, loading/updating the data, and recreate the index.
A ColumnStore can only be created on a table. You might be tempted to create on a view but you will lose your time trying.
There is no notion of clustered/non-clustered index for ColumnStore. They are always non-clustered.
Most simple data types are supported. The exceptions are decimal or numeric with precision > 18, datetimeoffset with precision > 2, binary, varbinary, image, text, ntext, varchar(max), nvarchar(max), cursor, hierarchyid, timestamp, uniqueidentifier, sqlvariant, xml.
ColumnStore is not compatible with filtered index, computed columns, sparse columns, unique attribute, replication, filestream, change tracking.
A ColumnStore cannot be a primary key or be used by a foreign key.
There is no ALTER syntax to modify a ColumnStore index. You need to drop it and recreate it.
Not available on SQL Azure yet.

One very important limitation is that tables exposing a ColumnStore index are read-only. If you try to run an INSERT, UPDATE, DELETE, or MERGE query, your execution will return an error message. You can disable the index for the time of data modification and reactivate it like this:

ALTER INDEX mycolumnstoreindex ON mytable DISABLE;
-- update the data -- 
ALTER INDEX mycolumnstoreindex ON mytable REBUILD;

Syntax of the ColumnStore index creation

The syntax is really similar to the RowStore index to which you are used too.

CREATE NONCLUSTERED COLUMNSTORE INDEX mycolumnstoreindex ON mytable (col1, col2, col3)

The order specified for the columns is not important as each one will get its own pages.

ColumnStore index creation using the wizard

Wizards are often useful. Creating a ColumnStore index is an example of a useful wizard. To use the wizard, you need to dig a couple of levels deep as shown in figure 3.

Figure 3: Starting the Wizard

Then the dialog shown in figure 4 will pop on the screen letting you set the properties of your new index.

Figure 4: The ColumnStore index dialog

When the dialog opens, it already contains an error (yellow line at the top). It reminds you that you need to select columns. To do so, just click the Add button and that will display another dialog with all the possible columns as shown in figure 5.

Figure 5: Selecting columns

We normally click the checkbox to the left of the Name label to select all the columns at once.

Preparing for performance comparison

For the sake of testing, I have created a test database containing a test database simulating a Client table in a data warehouse. This is my script that creates the table so that you see my fields:

CREATE TABLE [dbo].[Clients](
	[ID] [int] IDENTITY(1,1) NOT NULL,
	[LastName] [varchar](50) NULL,
	[FirstName] [varchar](50) NULL,
	[SSN] [varchar](11) NULL,
	[Title] [varchar](5) NULL,
	[EMail] [varchar](50) NULL,
	[Phone] [varchar](50) NULL,
	[AccountNumber] [varchar](50) NULL,
	[AddressLine] [varchar](50) NULL,
	[City] [varchar](50) NULL,
	[State] [varchar](50) NULL,
	[Country] [varchar](50) NULL,
	[Zip] [varchar](10) NULL,
 CONSTRAINT [PK_Clients] PRIMARY KEY CLUSTERED 
(
	[ID] ASC
)
)

I then used Red Gate SQL Data Generator to quickly generate 1,000,000 rows into the test table.

I also have created a second table (which I named ClientsColumnStore) with exactly the same definition. To ensure I have exactly the same data in both table, I used an INSERT INTO query to copy all the data from the Clients table to the ClientsColumnStore table.

Finally, using the wizard, I created a ColumnStore index on the ClientsColumnStore table containing all its columns.

Ready for comparison

We are now ready to test. I will use the same query on both table to compare the performance. Here are my 2 queries:

SELECT Country, Title, COUNT(*) FROM dbo.Clients GROUP BY Country, Title
SELECT Country, Title, COUNT(*) FROM dbo.ClientsColumnStore GROUP BY Country, Title

I decided to group on 2 fields to ensure that more than a single pages was used in the ColumnStore index.

The first results I compared was to run both queries at the same time showing the Actual Execution Plan (you can press CTRL-M to activate it). Figure 6 shows a clear winner. The RowStore took 97% of the total time (only 2 queries) to execute.

Figure 6: Execution plan result

Wow! It is clear that the ColumnStore index for that kind of operation is a clear winner.

Still in the Execution plan result, if you hover the “ColumnStore Index Scan” node, you will see a result like the one shown if figure 7 showing clearly that we are using the ColumnStore index feature.

Figure 7: Execution Plan details

I than wanted to check the number of reads. I turned off the Actual Execution Plan and activated some statistics with these queries:

SET STATISTICS TIME On
SET STATISTICS IO On

I reran exactly the same queries and now watched the Message tab which shows this result:

SQL Server parse and compile time: 
   CPU time = 5 ms, elapsed time = 5 ms.

(2070 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Clients'. Scan count 5, logical reads 19680, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 969 ms,  elapsed time = 317 ms.
--------------------

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

(2070 row(s) affected)
Table 'ClientsColumnStore'. Scan count 4, logical reads 274, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 202 ms.

Here, we clearly see that the first query generated a total of 19,680 reads compared to 274 for the second query. This is why the second query (based on the ColumnStore index) is a clear winner.

And don’t forget we only have a million records. Your data warehouse probably contains way more!

Conclusion

ColumnStore index is a feature you need to seriously consider if you have a large data warehouse running on Microsoft SQL Server 2012. Your performance can be greatly improved if you use it properly.

You will definitely stay away from it if you are managing an OLTP database.

(Print this page)