The columnstore index, a feature in SQL Server’s Database Engine, received updates and a major addition in SQL Server 2014. Columnstore indexes debuted in SQL Server 2012 as a solution for slow data warehouse queries.
What is a columnstore index?
A columnstore index is a technique for storing, retrieving, and managing table data that is physically stored by column. Traditionally, data has been stored in a row-based format, most commonly, in clustered indexes and nonclustered indexes. This format can be slow to execute queries on large amounts of data. Storing data in a columnar format improves the performance of data warehouse queries by decreasing disk access and memory usage.
When you create a columnstore index, SQL Server first divides the table’s rows into rowgroups. Each rowgroup holds between 102,400 and 1,048,575 rows and consists of column segments, one for each column. SQL Server compresses each column segment within the rowgroup and places them in the columnstore. Each column segment is stored on a separate data page.
What is Batch Execution Mode?
Introduced in conjunction with columnstore indexes, batch execution mode utilizes parallel processing in modern CPUs. Batch processing allows queries to execute a batch at a time rather than a row at a time. SQL Server 2014 supports more query operators than the 2012 release, including all join types, union all, and scalar aggregation. SQL Server divides the rows into batches of about 900 rows. Batch mode results in an overall reduction in the number of CPU instructions. Rows in a batch share the instruction cost because a query executes the instructions for each batch rather than for each row.
How does changing the physical layout of stored data improve performance?
Four features of columnstore indexes improve query performance:
· Individually accessed columns
Columnstore indexes store columns individually, which allows queries to fetch a subset of the total columns into memory. For example, a query that requires three of 30 columns in a table would pull 10% of the total columns into memory, whereas the same query using row storage would pull 100% of the data into memory.
· Data compression
Generally, data within a column is more likely to repeat than data within a row; hence, data in columnar format compresses more effectively. Compression for each column segment depends on the values within the specific column segment. Certain data types lend themselves to better compression rates; integers, such as surrogate keys in fact tables, will be more repetitive than long varchars. Compressed columns reduce the bytes the number of bytes the CPU reads and moves in to memory.
· Batch execution mode
As I mentioned before, batch mode takes advantage of modern computers’ multicore processors and increased memory throughput. Batch mode is vector-based, meaning that each column is stored as a vector in a separate space of memory. Vector processing increases CPU-utilization, resulting in better parallelism and faster performance.
· Column segment elimination
In addition to only fetching the required columns, columnstore indexes can filter the column segments that are pulled into memory based on a value specified in the query. If a segment does not contain a certain value, SQL Server does not pull the segment’s data into memory. Segment elimination is another reduction in I/O and memory usage.
What is different about the columnstore index feature in SQL Server 2014?
SQL Server 2012 supports only nonclustered columnstore indexes. The nonclustered columnstore index has a major limitation; that is, when you add one to a clustered table or heap, the underlying table becomes read-only. To update the table, you must either drop the index before you make changes or update the table on the side using partition switching. SQL Server 2014 includes a solution to this limitation.
SQL Server 2014 introduced clustered columnstore indexes. These indexes are updateable; users can insert, update, and delete rows without dropping the columnstore index. Although clustered columnstore indexes overcome the main limitation of nonclustered columnstore indexes, they are not without their own restrictions. For instance, clustered columnstore indexes must be the only index on a table. All columns must be included in the index; therefore, all of the columns must be columnstore-supported data types. SQL Server 2014 supports most data types besides exotic types, such as, varchar(max), binary, and image. Microsoft’s developer network has a complete list of supported and unsupported data types.
Another feature added in 2014 is archival mode. In archival mode, SQL Server compresses data even more resulting in better utilization of memory. More compressed data costs more to query, so you should only use archival mode with old data.
How is the clustered columnstore index updateable?
As previously mentioned, a nonclustered columnstore index’s data is stored in a columnstore. Clustered columnstore indexes have a columnstore and a deltastore. The deltastore makes the index updateable. The deltastore in each clustered columnstore index is a type of rowstore containing the leftover rows after the table splits into rowgroups. Through insert and update operations, the deltastore accumulates rows until it is large enough to create a rowgroup and move into the columnstore.
The deltastore handles updates to a clustered columnstore index through insert, update, and delete operations. For the insert operation, rows are added to the deltastore. Deleted rows are marked in the delete bitmap associated with each column segment, and are not actually deleted until the table is rebuilt. Update operations act as both an insert and a delete. The updated row is inserted into the deltastore, and the previous instance of the row is marked for deletion. If the row to update or delete is in the deltastore, the row is updated or deleted the same as in a rowstore.
When should I use a columnstore index?
SQL Server developers designed columnstore indexes to enhance the performance of data warehouse queries. Columnstore indexes are most effective when used in scanning and aggregating large amounts of data and when joining multiple tables. Typically, columnstore indexes are added to large fact tables that join with multiple small dimension tables in a star-schema. Using batch processing in conjunction with a columnstore index greatly improves the speed of queries.
In addition to adding a columnstore index to fact tables, you can use a columnstore index if your entire data set will not fit into memory. With the improved compression in a columnstore, your data would better fit in memory.
Though columnstore indexes improve query performance, they are not beneficial in every situation. First, the performance advantages of columnstore indexes are less noticeable on small tables, those with less than one million rows. Additionally, a columnstore index may not fit your solution if you will be selecting all rows for a query. In this case, you are not benefiting from the selective column feature.
Lastly, row storage is typically better when querying for a single row or small range of rows that can be quickly located using a B-tree index because there is a cost required to piece the rows back together from a columnstore index.
How do I know if my query is using the columnstore index?
The execution plan shows each step of the query execution. If the part of the query is using a columnstore index, you will see the Columnstore Index Scan icon.
When you hover over the index scan, the Actual Execution Plan appears which displays three new properties: Actual Execution Mode, Estimated Execution Mode, and Storage. Storage will have the value ColumnStore or RowStore. When storage is ColumnStore, the execution mode can be batch or row. The screenshot below shows a columnstore index scan running in batch mode with 1044 batches containing a total of 483,184 rows.
SQL Server executes in batch mode only if parallelism is enabled and columnstore storage is used in the scan. If your scan is using columnstore storage and row execution mode, check to see that your max level of parallelism is set to zero (default) or a number greater than one. Parallelism of one means that only one process can run at a time; therefore, parallelism is not allowed.
Will I really see an improvement in my query execution?
Yes, I tested columnstore indexes running in both batch and row mode against clustered indexes, and found the improvements using a relatively small table.
I executed the following query on Microsoft’s AdventureWorks2012DW database to increase the number of rows in the dbo.FactInternetSales table from 60,398 to 966,368.
I ran this query to compare the execution times and cost of columnstore indexes running in batch execution code, columnstore indexes running in row execution mode, and clustered indexes (rowstore).
With just under one million rows in dbo.FactInternetSales, my query results demonstrate major differences in query performance when using a clustered index, a columnstore index running in row execution mode, and a columnstore index running in batch execution mode. The table below shows a few of the execution costs and an average execution time from the query. As expected, the nonclustered and clustered columnstore indexes returned similar costs and execution times, so the table displays just the results of the nonclustered columnstore index running in both batch and row execution mode.
As the table indicates, the estimated I/O cost and estimated operator cost significantly decreased from the clustered index to the columnstore index. Execution time improved from 4.3 seconds with the clustered index to 3.6 seconds with a columnstore index running in row execution mode; furthermore, the more noticeable improvements occurred between the columnstore index running in row execution mode versus batch execution mode. To run the query in batch mode, I increased the max parallelism from 1 to 2, which increased the speed of the query to less than one second.
|Clustered Index (RowStore)||Columnstore Index – Row Execution Mode||Columnstore Index – Batch Execution Mode|
|Execution Time||4.3 seconds||3.6 seconds||.5 seconds|
|Estimated I/O Cost||29.9||0.3||0.1|
|Estimated CPU cost||1.1||1.1||0.05|
|Estimated Operator Cost||31.0 (70%)||1.4 (9%)||0.2(8%)|
Columnstore indexes are a new technology for storing, retrieving, and managing data in columnar format that greatly improves performance of data warehouse queries. Next time you are working with a large data table, consider adding a columnstore index to decrease the I/O and memory usage.