Tuesday, July 8, 2014

Columnstore Table Analyzer

As I’ve discussed in some of my previous posts, there are quite a few data types that cannot be part of a Columstore index. While there are fewer restrictions in SQL Server 2014, they still exist. I find myself constantly looking back at Books Online trying to make sure data types in my tables don’t contain any of those restricted data types. It would be much easier to know from day one which tables I need to redesign, or at least which columns I need to exclude from a non-clustered Columnstore index. This is why I have created the following script.

-- Find columns in user tables that cannot be included in a columnstore index.
-- These restrictions apply to both clustered and non-clustered columnstore indexes.
-- SQL Server 2014: http://msdn.microsoft.com/en-us/library/gg492153(v=sql.120).aspx
-- SQL Server 2012: http://msdn.microsoft.com/en-us/library/gg492153(v=sql.110).aspx


-- Get the version number of SQL Server
DECLARE @ServerVersion TINYINT = CONVERT(INT,SUBSTRING(CONVERT(VARCHAR,SERVERPROPERTY('ProductVersion')),1,(CHARINDEX('.',(CONVERT(VARCHAR,SERVERPROPERTY('ProductVersion'))))-1)))

IF @ServerVersion = 11
-- This section is only for SQL Server 2012
BEGIN
  SELECT 
     s.name AS 'SchemaName'
    ,o.name AS 'TableName'
    ,c.name AS 'ColumnName'
    ,'ColumnType' = CASE t.name
      WHEN 'decimal' THEN t.name + '(' + CONVERT(VARCHAR,c.precision) + ',' + CONVERT(VARCHAR,c.scale) + ')'
      WHEN 'numeric' THEN t.name + '(' + CONVERT(VARCHAR,c.precision) + ',' + CONVERT(VARCHAR,c.scale) + ')'
      WHEN 'varchar' THEN
        CASE c.max_length
          WHEN -1 THEN 'varchar(max)'
          ELSE 'varchar(' + CONVERT(VARCHAR,c.max_length) + ')'
        END
      WHEN 'nvarchar' THEN 
        CASE c.max_length
          WHEN -1 THEN 'nvarchar(max)'
          ELSE 'nvarchar(' + CONVERT(VARCHAR,c.max_length) + ')'
        END
      WHEN 'datetimeoffset' THEN t.name + '(' + CONVERT(VARCHAR,c.scale) + ')'
      ELSE t.name
     END
    ,'ColumnAttribute' = CASE 
      WHEN (c.is_filestream = 1) THEN 'Filestream'
      WHEN (c.is_sparse = 1) THEN 'Sparse'
      ELSE ''
    END
  FROM sys.columns c 
  JOIN sys.objects o ON c.object_id = o.object_id
  JOIN sys.types t ON c.user_type_id = t.user_type_id
  JOIN sys.schemas s ON o.schema_id = s.schema_id
  WHERE o.is_ms_shipped <> 1
  -- These types cannot be part of a SQL Server 2012 columnstore index
  AND (
    t.name IN 
         ('binary'
      ,'varbinary'
      ,'ntext'
      ,'text'
      ,'image'
      ,'uniqueidentifier'
      ,'rowversion'
      ,'timestamp'
      ,'sql_variant'
      ,'hierarchyid'
      ,'geography'
      ,'geometry'
      ,'xml')
  OR (
      -- Decimal & numeric cannot have a precision over 18
      t.name IN ('decimal','numeric') 
      AND c.precision > 18)
  OR (
      -- Varchar(max) and nvarchar(max)
      t.name = 'datetimeoffset'
      AND c.scale > 2)
  OR (
      -- Varchar(max) and nvarchar(max)
      t.name IN ('varchar','nvarchar') 
      AND c.max_length = -1)
  OR (
      -- Filestream
      c.is_filestream = 1)
  OR (
      -- Sparse
      c.is_sparse = 1)
  )
  ORDER BY s.name,o.name,c.column_id
END
ELSE IF @ServerVersion = 12
-- This section is only for SQL Server 2014
BEGIN
  SELECT 
     s.name AS 'SchemaName'
    ,o.name AS 'TableName'
    ,c.name AS 'ColumnName'
    ,'ColumnType' = CASE t.name
      WHEN 'varchar' THEN
        CASE c.max_length
          WHEN -1 THEN 'varchar(max)'
          ELSE 'varchar(' + CONVERT(VARCHAR,c.max_length) + ')'
        END
      WHEN 'nvarchar' THEN 
        CASE c.max_length
          WHEN -1 THEN 'nvarchar(max)'
          ELSE 'nvarchar(' + CONVERT(VARCHAR,c.max_length) + ')'
        END
      ELSE t.name
     END
    ,'ColumnAttribute' = CASE 
      WHEN (c.is_filestream = 1) THEN 'Filestream'
      WHEN (c.is_sparse = 1) THEN 'Sparse'
      ELSE ''
    END
  FROM sys.columns c 
  JOIN sys.objects o ON c.object_id = o.object_id
  JOIN sys.types t ON c.user_type_id = t.user_type_id
  JOIN sys.schemas s ON o.schema_id = s.schema_id
  WHERE o.is_ms_shipped <> 1
  -- These types cannot be part of a SQL Server 2014 columnstore index
  AND (
    t.name IN 
      ('ntext'
      ,'text'
      ,'image'
      ,'rowversion'
      ,'timestamp'
      ,'sql_variant'
      ,'hierarchyid'
      ,'geography'
      ,'geometry'
      ,'xml')
  OR (
      -- Varchar(max) and nvarchar(max)
      t.name IN ('varchar','nvarchar') 
      AND c.max_length = -1)
  OR (
      -- Filestream
      c.is_filestream = 1)
  OR (
      -- Sparse
      c.is_sparse = 1)
  )
  ORDER BY s.name,o.name,c.column_id
END
ELSE
BEGIN
  RAISERROR ('This script only works on SQL Server 2012 and SQL Server 2014.',16,1);
END
GO

Running this against your database will output an organized list of tables along with the column name and data type that cannot be used within a Columnstore index.


The script can be used if you plan to create a clustered or non-clustered index, since the data type restrictions would apply to both. The script can also be used to analyze databases in either SQL Server 2012 or 2014.

You can read more about the limitations and restrictions of Columnstore indexes in Books Online.

Tuesday, June 3, 2014

Columnstore Memory Grant Issue

In a previous post about non-clustered columnstore indexes, I mentioned the creation of an index is a very memory intensive operation. Sometimes the memory grant needed exceeds what is currently available on your server. So what do you do about it?

SQL Server requires a minimal amount of memory in order to create a columnstore index. This can be calculated as Memory Grant Request in MB = ((4.2 * number of columns in the columnstore index) + 68) * Degree of Parallelism + (number of string columns * 34). If there is not enough physical memory available to create the columnstore index, SQL Server will throw an error.

The test server I’m using for the examples below has 2 CPUs and 4GB of memory.


The Max Degree of Parllelism is set to 0 and Max Server Memory is set to 4095MB.

EXEC sp_configure 'max degree of parallelism';
GO
EXEC sp_configure 'max server memory (MB)';
GO


I have enlarged a table, FactInternetSales, in the AdventureWorksDW2012 database using Kalen Delaney’s script. The table has 247 million rows and contains 26 columns; three of which are string columns. If we want to create a non-clustered columnstore index on the entire table, then using the formula above we could estimate a memory grant of 456MB would be needed to build the index.
To create the index we’ll use this query.

USE AdventureWorksDW2012;
GO
CREATE NONCLUSTERED COLUMNSTORE INDEX csi_FactInternetSalesBig ON dbo.FactInternetSalesBig
(
    ProductKey,
    OrderDateKey,
    DueDateKey,
    ShipDateKey,
    CustomerKey,
    PromotionKey,
    CurrencyKey,
    SalesTerritoryKey,
    SalesOrderNumber,
    SalesOrderLineNumber,
    RevisionNumber,
    OrderQuantity,
    UnitPrice,
    ExtendedAmount,
    UnitPriceDiscountPct,
    DiscountAmount,
    ProductStandardCost,
    TotalProductCost,
    SalesAmount,
    TaxAmt,
    Freight,
    CarrierTrackingNumber,
    CustomerPONumber,
    OrderDate,
    DueDate,
    ShipDate
);
GO

Once we execute that query, we can view the memory grant using this query.

SELECT
     dop
    ,requested_memory_kb
    ,required_memory_kb
    ,ideal_memory_kb
    ,granted_memory_kb
FROM sys.dm_exec_query_memory_grants
WHERE session_id = 55;
GO


The requested memory grant was actually 525MB, but it still wasn’t too far from our estimation. What would happen to the create index query if we change the max server memory to 1024MB? Let’s find out.

EXEC sp_configure 'max server memory (MB)',1024;
GO
RECONFIGURE;
GO

Oops, we got an error.

The statement has been terminated.
Msg 8658, Level 17, State 1, Line 1
Cannot start the columnstore index build because it requires at least 341424 KB, while the maximum memory grant is limited to 189696 KB per query in workload group 'default' (2) and resource pool 'default' (2). Retry after modifying columnstore index to contain fewer columns, or after increasing the maximum memory grant limit with Resource Governor.

This is telling us we need an absolute minimum of 341MB of memory to create the colunstore index. That seems weird since we have 1024MB of memory configured for the server. Well not really. The Resource Governor is always running in the background for all SQL Servers, and every query executes inside the default workgroup pool. That default workload group has a limit of granting no more than 25% of available memory to one single query. This can be verified by running the following query.

SELECT
     name
    ,request_max_memory_grant_percent
FROM sys.dm_resource_governor_workload_groups
WHERE name = 'default';
GO


There are a couple of workarounds to this problem.

First, we can try to reduce the max degree of parallelism when creating the index. Using the formula above, that would have an estimated memory grant of 279MB. That’s still more than our minimum allowable grant size, but let’s try it anyway. All we have to do is add the WITH (MAXDOP=1) hint to the end of the create index statement.

USE AdventureWorksDW2012;
GO
CREATE NONCLUSTERED COLUMNSTORE INDEX csi_FactInternetSalesBig ON dbo.FactInternetSalesBig
(
    ProductKey,
    OrderDateKey,
    DueDateKey,
    ShipDateKey,
    CustomerKey,
    PromotionKey,
    CurrencyKey,
    SalesTerritoryKey,
    SalesOrderNumber,
    SalesOrderLineNumber,
    RevisionNumber,
    OrderQuantity,
    UnitPrice,
    ExtendedAmount,
    UnitPriceDiscountPct,
    DiscountAmount,
    ProductStandardCost,
    TotalProductCost,
    SalesAmount,
    TaxAmt,
    Freight,
    CarrierTrackingNumber,
    CustomerPONumber,
    OrderDate,
    DueDate,
    ShipDate
) WITH (MAXDOP=1);
GO

Oops, we got the same error again.

The statement has been terminated.
Msg 8658, Level 17, State 1, Line 1
Cannot start the columnstore index build because it requires at least 341424 KB, while the maximum memory grant is limited to 189696 KB per query in workload group 'default' (2) and resource pool 'default' (2). Retry after modifying columnstore index to contain fewer columns, or after increasing the maximum memory grant limit with Resource Governor.

Now it’s on to our next workaround; changing the Resource Governor settings. As I mentioned above, the Resource Governor limits each query to have a memory grant of 25% of the total available. We can easily be changed by using the following query.

ALTER WORKLOAD GROUP [default] WITH (request_max_memory_grant_percent = 50);
GO
ALTER RESOURCE GOVERNOR RECONFIGURE;
GO

The setting change is dynamic, but let’s double check it.

SELECT
     name
    ,request_max_memory_grant_percent
FROM sys.dm_resource_governor_workload_groups
WHERE name = 'default';
GO


Now let’s try to create the columnstore index again.

Ahh success! We can check the memory grant while it’s running.

SELECT
     dop
    ,requested_memory_kb
    ,required_memory_kb
    ,ideal_memory_kb
    ,granted_memory_kb
FROM sys.dm_exec_query_memory_grants
WHERE session_id = 55;
GO


Once the columnstore index has been created, you should change the default workload group back to its default of 25. That way it doesn’t adversely affect other queries running on your server.

ALTER WORKLOAD GROUP [default] WITH (request_max_memory_grant_percent = 25);
GO
ALTER RESOURCE GOVERNOR RECONFIGURE;
GO

Another workaround is to exclude some of the 26 columns from our create index statement. Using the formula above, if we removed the last six columns of the table, we’d only need a 186MB memory grant. However, that may not be an option, because our user queries may need those columns to be part of the colunstore index to get the maximum performance for their queries.

Finally, as last workaround, you could add more memory to the server, but unless your server is VM that might be bit impossible for most situations.

For more info on columnstore indexes, check out the Columnstore Index FAQ on TechnetBooks Online, and my other blog posts.

Tuesday, May 20, 2014

Comparison of Columnstore Compression

SQL Server 2012 introduced non-clustered columnstore indexes, and SQL Server 2014 gave us clustered columnstore indexes. Both share the same technology for performance boosts, and they both share the same algorithms for compression. However, the compression will depend on the data you are storing.

SQL Server uses a mechanism of row groups, segments, encoding and compression to store the data for columnstore indexes. First the data is horizontally divided into row groups, where each group contains approximately 1 million rows. Next, the row groups are vertically partitioned into segments, where each column of the table is its own segment. Those segments are then encoded and compressed before being stored on disk. So if you create a colulmnstore index on a table that has five columns and 100,000 rows, it would result in one row group with five segments.

Compression of any type will definitely vary depending on the data that needs to be compressed. Columnstore indexes use RLE (run-length encoding) compression which works best when you have a lot of repetitive values. Just to see how this works, let’s compare the compression for data that is random to the compression for data that is highly redundant.

First, we’ll create the following tables that contain only a single column using the CHAR data type. Each table will store 1 million rows. One table will store random values for each row, and the other will store identical values for each row.

CREATE TABLE dbo.IdenticalCharacter(Col1 CHAR(50) NULL);
GO
CREATE TABLE dbo.RandomCharacter(Col1 CHAR(50) NULL);
GO

Now we'll populate the tables will artificial data.

DECLARE @Counter int = 0;
WHILE @Counter < 1000000
BEGIN
    INSERT INTO IdenticalCharacter VALUES (REPLICATE('a',50));
    SET @Counter = @Counter+1;
END
CHECKPOINT;
GO
DECLARE @Counter int = 0;
WHILE @Counter < 1000000
BEGIN
    INSERT INTO RandomCharacter
    SELECT CAST(NEWID() AS CHAR(50));
    SET @Counter = @Counter+1;
END
CHECKPOINT;
GO

Next, we’ll create a non-clustered columnstore index on each table.

CREATE NONCLUSTERED COLUMNSTORE INDEX csi1 ON dbo.IdentitcalCharacter(Col1);
GO
CREATE NONCLUSTERED COLUMNSTORE INDEX csi1 ON dbo.RandomCharacter(Col1);
GO

Now we can calculate the size of each columnstore index using the script below. This is a modified version of the script from Books Online.

SELECT
     SchemaName
    ,TableName
    ,IndexName
    ,SUM(on_disk_size_MB) AS TotalSizeInMB
FROM
(
    SELECT
        OBJECT_SCHEMA_NAME(i.OBJECT_ID) AS SchemaName,
        OBJECT_NAME(i.OBJECT_ID ) AS TableName,
        i.name AS IndexName
        ,SUM(css.on_disk_size)/(1024.0*1024.0) AS on_disk_size_MB
    FROM sys.indexes i
    INNER JOIN sys.partitions p
        ON i.object_id = p.object_id
    INNER JOIN sys.column_store_segments css
        ON css.hobt_id = p.hobt_id
    WHERE i.type_desc = 'NONCLUSTERED COLUMNSTORE'
        AND i.index_id NOT IN (0,1)
        AND p.index_id NOT IN (0,1)
    GROUP BY OBJECT_SCHEMA_NAME(i.OBJECT_ID) ,OBJECT_NAME(i.OBJECT_ID ),i.name

    UNION ALL

    SELECT
         OBJECT_SCHEMA_NAME(i.OBJECT_ID) AS SchemaName
        ,OBJECT_NAME(i.OBJECT_ID ) AS TableName
        ,i.name AS IndexName
        ,SUM(csd.on_disk_size)/(1024.0*1024.0) AS on_disk_size_MB
    FROM sys.indexes i
    INNER JOIN sys.partitions p
        ON i.object_id = p.object_id
    INNER JOIN sys.column_store_dictionaries csd
        ON csd.hobt_id = p.hobt_id
    WHERE i.type_desc = 'NONCLUSTERED COLUMNSTORE'
        AND i.index_id NOT IN (0,1)
        AND p.index_id NOT IN (0,1)
    GROUP BY OBJECT_SCHEMA_NAME(i.OBJECT_ID) ,OBJECT_NAME(i.OBJECT_ID ),i.name
) AS SegmentsPlusDictionary
GROUP BY SchemaName,TableName,IndexName
ORDER BY SchemaName,TableName,IndexName;
GO


As you can see, the random data values did not compress nearly as much. 30MB of the total 41MB was just for the dictionaries, while only 11MB was needed for the segments. When you graph the data in Excel, it makes it little bit easier to see the difference in size.


We can also compare a few other data types (integer and decimal), and we’ll see similar differences, but just not as drastic.

CREATE TABLE dbo.IdenticalInteger(Col1 INT NULL);
GO
CREATE TABLE dbo.RandomInteger(Col1 INT NULL);
GO
CREATE TABLE dbo.IdenticalDecimal(Col1 DECIMAL(18,8) NULL);
GO
CREATE TABLE dbo.RandomDecimal(Col1 DECIMAL(18,8) NULL);
GO
DECLARE @Counter int = 0;
WHILE @Counter < 1000000
BEGIN
    INSERT INTO IdenticalInteger VALUES (1);
    SET @Counter = @Counter+1;
END
CHECKPOINT;
GO
DECLARE @Counter int = 0;
WHILE @Counter < 1000000
BEGIN
    INSERT INTO RandomInteger VALUES (@Counter);
    SET @Counter = @Counter+1;
END
CHECKPOINT;
GO
DECLARE @Counter int = 0;
WHILE @Counter < 1000000
BEGIN
    INSERT INTO IdenticalDecimal VALUES (1.0);
    SET @Counter = @Counter+1;
END
CHECKPOINT;
GO
DECLARE @Counter int = 0;
WHILE @Counter < 1000000
BEGIN
    INSERT INTO RandomDecimal VALUES (CAST((RAND() * @Counter) AS DECIMAL(18,8)));
    SET @Counter = @Counter+1;
END
CHECKPOINT;
GO
CREATE NONCLUSTERED COLUMNSTORE INDEX csi1 ON dbo.IdenticalInteger(Col1);
GO
CREATE NONCLUSTERED COLUMNSTORE INDEX csi1 ON dbo.RandomInteger(Col1);
GO
CREATE NONCLUSTERED COLUMNSTORE INDEX csi1 ON dbo.IdenticalDecimal(Col1);
GO
CREATE NONCLUSTERED COLUMNSTORE INDEX csi1 ON dbo.RandomDecimal(Col1);
GO

Now run the same query from above to see the size difference of the indexes.


Here we see similar results as before; the highly redundant data gets compressed far more than the random data. Again, graphing the data makes it a little easier to visualize the difference.


The compression algorithm used for columnstore indexes is different from the row and page data compression originally released in SQL Server 2008; however, it’s interesting to compare the differences.

For this example, we’ll use actual data from a trading application. The fact table contains over 106 million rows, 26 columns which are mostly float data types, and is just over 21GB in size. Let’s compare the various types of compression: row, page, non-clustered columnstore index for SQL Server 2012, non-clustered columnstore index for SQL Server 2014, clustered columnstore index for SQL Server 2014, and lastly a clustered columnstore index with archival compression for SQL Server 2014.


The graph above shows row and page compression already have compressed the data significantly, but the various columnstore indexes compress it even further. There are a couple of things to keep in mind. First, the non-clustered columnstore indexes are a separate copy from the base table, so the 7199MB 2012 columnstore index would be in addition to the 21GB of raw data for a total of nearly 29GB. The two clustered columnstore indexes only need 3.4GB and 2.6GB to store the entire dataset. That’s a 6X and 8X compression difference.

You may wonder why the 2012 non-clustered columnstore index is larger (7199MB) than the just using ordinary row or page compression, but there really is nothing wrong. It’s just a good example showing the differences in the compression algorithms. Although this example doesn’t show it, there are some cases where the columnstore compressed data could be higher than the actual data.

As you can see, columnstore indexes have to potential to highly compress your data, but it really depends on the type of data you’re storing and redundancy of the data values. The only way to know for sure to is to test it with your own data.

For more info on columnstore indexes, check out the Columnstore Index FAQ on TechnetBooks Online, and my other blog posts.

Tuesday, April 29, 2014

What is a Non-Clustered Columnstore Index?

First introduced in SQL Server 2012, the Columnstore index is a new in-memory feature that allows for the creation of indexes that are stored in a column-wise fashion. It is targeted for data warehouses and can improve query performance by 10 to 100x. A columnstore index stores data in a different type of index page as well as heavily compressing the data. It also introduces a new batch execution mode that greatly increases processing speed and reduces CPU utilization. The combination of the new storage type, the compressed data, and batch mode processing allows SQL Server to read less data and greatly improve query performance.

So how does this new index differ from traditional indexes? In all versions prior to 2012, SQL Server uses a B-tree structure (see below) for all indexes, including both clustered and non-clustered.


At the leaf level of the B-tree indexes, SQL Server stores the indexed data in a row-wise fashion. So a single index page will store all data from all columns for one or more rows. For example, if we have create an index with three columns (C1, C2, and C3) for dbo.Table1, the internal structure would look something like this.


Index page 1 would store columns C1, C2, and C3 for rows R1, R2 and R3. Index page 2 would store columns C1, C2, and C3 for rows R4, R5 and R6, and so on and so on. This is referred to as a rowstore, because it stores the data in a row-wise fashion.

If we were to issue a simple query to select column C1 from the Table1, then SQL Server would need to read all three index pages to get the values from all 9 rows. Each read would be a logical I/O, so a total of 3 logical I/O’s would be needed.

SELECT C1 FROM dbo.Table1;
GO


Now if we create a columnstore index on the same data, the logical storage of that data would look something like this.


Index page 1 would store column C1 for rows R1 through R9. Index page 2 would store columns C2 for rows R1 through R9, and so on and so on. This is referred to as a columnstore index, because it stores the data in a column-wise fashon.

If we issued the same query, SQL Server could use the new columnstore index and would only need to read index page 1 to get all the values of column C1 for rows R1 through R9. This results in 1 logical I/O since only a single index page needs to be read.


For this simple query we have a 3x performance improvement. Now 1 I/O vs 3 I/O’s does doesn’t really matter, but remember this feature is targeted at data warehouses that could have tables containing millions or even billions of rows. Once you use columnstore indexes on tables that large, then the performance gains in I/O are much more noticeable.

The second advantage of using columnstore indexes is the data stored within the index is heavily compressed. This compression algorithm is different than row or page compression or even backup compression, and it usually works best for character or numeric data that have a lot of repeating values. Having the data compressed, means SQL Server needs to read fewer pages to return all of the data. Going back to our Table1 example from above, if the indexes were storing lots more data then the layout would really look more like this.


Because of that highly compressed data, the columnstore index might only need a few dozen index pages to store the same amount of data that would normally take a rowstore index several thousand.

The third way columnstore indexes help improve performance is through the use of batch mode processing. This new execution mode allows the CPU to process approximately 1000 row at a time. This is more efficient for the CPU to process a batch of data instead of one row at a time. The result is faster query processing and lower CPU utilization. In the example query plan below, you can see SQL Server issued 581725 batches to process the ~247 million rows. That's an average of 425 rows processed per batch.


Creating a columnstore index is as easy as creating any other non-clustered index. Just specify the name of the index, the table, and which columns are to be included. The best practice from Microsoft suggests adding all columns of the base table into the columnstore index. See the example below.

CREATE NONCLUSTERED COLUMNSTORE INDEX csi_FactInternetSales
  ON dbo.FactInternetSales
(
  ProductKey,
  OrderDateKey,
  DueDateKey,
  ShipDateKey,
  CustomerKey,
  PromotionKey,
  CurrencyKey,
  SalesTerritoryKey,
  SalesOrderNumber,
  SalesOrderLineNumber,
  RevisionNumber,
  OrderQuantity,
  UnitPrice,
  ExtendedAmount,
  UnitPriceDiscountPct,
  DiscountAmount,
  ProductStandardCost,
  TotalProductCost,
  SalesAmount,
  TaxAmt,
  Freight,
  CarrierTrackingNumber,
  CustomerPONumber,
  OrderDate,
  DueDate,
  ShipDate
);
GO

All of this is really great news for SQL Server lovers; however, as with all new features, there are usually limitations and columnstore indexes are no exception. Below is a list of some of those restrictions.
  1. The base table is READ-ONLY.
  2. The columnstore index cannot be altered; only dropped and recreated.
  3. The columnstore index cannot be created as clustered index.
  4. Certain data types cannot be used.
  5. Cannot include a sparse column.
  6. Cannot include a column with Filestream data.
  7. The index cannot act as a primary or foreign key.
  8. Cannot be combined with replication, change tracking, or change data capture.
  9. Uses the buffer pool, so columnstore index pages can be flushed from memory.
  10. Enterprise Edition only
I’m sure the first item on the list got your attention. A limitation that has the base table and all of its data is in a read-only state after creating a columnstore index. Any attempt to change data within the base table (dbo.Table1) will result in an error.

INSERT dbo.Table1 VALUES ('Oscar', 'The new DBA', 'Charlotte, NC');
GO

Msg 35330, Level 15, State 1, Line 1
INSERT statement failed because data cannot be updated in a table with a columnstore index. Consider disabling the columnstore index before issuing the INSERT statement, then rebuilding the columnstore index after INSERT is complete.

This may seem like a huge issue, but in reality it’s not that much of a problem.  Keep in mind this feature is targeted at data warehouses that modify data infrequently. Additionally, there are several workarounds for updating or changing data; for example, deleting the columnstore index making the change and then recreating the columnstore index. I’ll cover these workarounds in more detail in a future blog post.

In my opinion, the limitation of the data types is a bigger issue than the data being in a read-only state; that’s because there is no workaround. In essence, this will affect the design of your database. All of these data types are not allowed for use within a columnstore index.
  1. binary and varbinary
  2. ntext , text, and image
  3. varchar(max) and nvarchar(max)
  4. uniqueidentifier
  5. rowversion (and timestamp)
  6. sql_variant
  7. decimal (and numeric) with precision greater than 18 digits
  8. datetimeoffset with scale greater than 2
  9. CLR types (hierarchyid and spatial types)
  10. xml
If you have an existing data warehouse that you want to use this feature and the tables are littered with uniqueidentifier or XML data types, then you’re only choice is to redesign the database. That’s usually the last thing a development team wants to hear just so they can implement a new database feature. Even if you try to add one of these columns to a columnstore index, SQL Server will throw an error about the restricted data type.

Msg 35343, Level 16, State 1, Line 1
CREATE INDEX statement failed. Column 'IDNumber' has a data type that cannot participate in a columnstore index. Omit column 'IDNumber'.

You can still create a columnstore index on the table by excluding the restricted column; however, if queries need data from that column then it could be much more expensive to use the columnstore index than a traditional rowstore index. The following example excluded the IDNumber column from the columnstore index, so the query processor needed to do a row lookup to get that missing data.

SELECT * FROM dbo.Table1;
SELECT * FROM dbo.Table1 WITH (INDEX(ColumnStoreIndex));
GO


As you can see the result of the forced columnstore plan did not fare well against the table scan. Its query cost is nearly twice as high and table scan needed only 1 logical read compared to 38 for the columnstore index.

Another issue you might run into is the amount of available physical memory on the server. SQL Server requires a minimal amount of memory in order to create a columnstore index. This can be calculated as Memory Grant Request in MB = ((4.2 * number of columns in the columnstore index) + 68) * Degree of Parallelism + (number of string columns * 34). If there is not enough physical memory available to create the columnstore index, SQL Server will throw an error.

The statement has been terminated.
Msg 8658, Level 17, State 1, Line 2
Cannot start the columnstore index build because it requires at least 345520 KB, while the maximum memory grant is limited to 84008 KB per query in workload group 'default' (2) and resource pool 'default' (2). Retry after modifying columnstore index to contain fewer columns, or after increasing the maximum memory grant limit with Resource Governor.

Microsoft did an excellent job of creating a very detailed error message that even includes a few suggestions to avoid this error. One of which is to alter the Resource Governor to allow larger memory grants. However, if that is not an option, then you can use the MAXDOP hint to reduce the degree of parallelism when creating the columnstore index. By reducing or even removing parallelism, it will reduce the memory grant requirements. I’ll cover these memory workarounds in more detail in a future blog post.

UPDATE: Read my post on Columnstore Memory Grant Issue to learn how to avoid this issue.

Let’s look at some examples that involve a lot more data.

Using the FactInternetSales from the AdventureWorksDW2012 database, I have expanded it to nearly 250 million rows. You can get the code to expand the table from Kalen Delaney's blog. We can run the following query to determine the order quantity and total sales grouped by education level.

SELECT
   dc.EnglishEducation AS EducationLevel
  ,COUNT(*) AS OrderQuantity
  ,CAST(SUM(fis.SalesAmount) AS DECIMAL(15,2)) AS SalesAmount
FROM dbo.FactInternetSales fis
  INNER JOIN dbo.DimCustomer dc ON fis.CustomerKey = dc.CustomerKey
GROUP BY dc.EnglishEducation
ORDER BY dc.EnglishEducation
OPTION (IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX);
GO

This query uses the IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX hint which tells the query optimizer to ignore a columnstore index that might be on this table. In order for SQL Server to process this query, it will need the value of every row for the CustomerKey and SalesAmount columns from that FactInternetSales. The traditional rowstore index will require SQL Server to read every page from the table to get those values. What we get is a query that requires 5934694 logical reads and over 28 minutes of runtime to complete. Run the same query again, but this time remove the hint and allow it use the columnstore index.

SELECT
   dc.EnglishEducation AS EducationLevel
  ,COUNT(*) AS OrderQuantity
  ,CAST(SUM(fis.SalesAmount) AS DECIMAL(15,2)) AS SalesAmount
FROM dbo.FactInternetSales fis
  INNER JOIN dbo.DimCustomer dc ON fis.CustomerKey = dc.CustomerKey
GROUP BY dc.EnglishEducation
ORDER BY dc.EnglishEducation;
GO

What you’ll notice is an almost instantaneous response from SQL Server. SQL Server only needed to do 6822 logical reads from FactInternetSales, and the runtime was reduced to about 4 seconds. That’s over 800x fewer IO’s and about 420x faster runtime.

Workloads that do a lot of table or index scans, such as the example above, will benefit most from using a columnstore index. However, singleton lookups will not perform as well. We can use the same query but this time we'll specify order numbers starting with a certain value. We'll also use an index hint WITH (INDEX(csi_FactInternetSales)) to force the query optimizer to use the columnstore index.

SELECT
   dc.EnglishEducation AS EducationLevel
  ,COUNT(*) AS OrderQuantity
  ,CAST(SUM(fis.SalesAmount) AS DECIMAL(15,2)) AS SalesAmount
FROM dbo.FactInternetSales fis WITH (INDEX(csi_FactInternetSales))
  INNER JOIN dbo.DimCustomer dc ON fis.CustomerKey = dc.CustomerKey
WHERE fis.SalesOrderNumber LIKE 'SO437%'
GROUP BY dc.EnglishEducation
ORDER BY dc.EnglishEducation;
GO

Forcing the query optimizer to use the columnstore index was a bad idea in this case. The columnstore index was used, but it had to scan the entire set of data to look for values starting with 'SO437%'. This resulted in over 1.7 million logical reads and about 68 seconds before returning the results. Now remove the index hint and run the query again.

SELECT
   dc.EnglishEducation AS EducationLevel
  ,COUNT(*) AS OrderQuantity
  ,CAST(SUM(fis.SalesAmount) AS DECIMAL(15,2)) AS SalesAmount
FROM dbo.FactInternetSales fis
  INNER JOIN dbo.DimCustomer dc ON fis.CustomerKey = dc.CustomerKey
WHERE fis.SalesOrderNumber LIKE 'SO437%'
GROUP BY dc.EnglishEducation
ORDER BY dc.EnglishEducation;
GO

We still had to do 11254 logical reads on the rowstore index for the table, but we were able to get the results in less than a second.

As you can see, if your workload meets the requirements, then the benefits of using columnstore indexes far out weight their disadvantages. I highly recommend lots of testing if you intend to implement columnstore indexes. It takes careful planning for loading or modifying data, designing your tables to account for the data type restrictions, and knowing how to write your queries to get the most performance from the columnstore indexes.

For more info on columnstore indexes, check out the Columnstore Index FAQ on TechnetBooks Online, and my other blog posts.

Tuesday, April 8, 2014

The system_health Extended Event Session

When I first started poking around in SQL Server 2012, I noticed an extended event session called “system_health” was created by default. It took me a few months before I really dug into the session details to see what it was capturing. But once I did, I was pretty amazed.

The system_health session starts up by default and collects various performance points that can be used to help troubleshoot issues on the server. To look at the details of it, just right click on the event session and choose properties. This will open the Extended Events UI. Select the events page, and from there you will see each of the events that will be captured. You can see more the details of each event by clicking the Configure button.

For example, if you select the error_reported event, then click Configure. This will show you the details of which columns will be captured for the event and if there are any predicates defined.


On the Global Fields tab, you’ll see the actions being performed by the session. These are additional collection items that are captured when the event fires. In this case, the session is getting the callstack, database_id, session_id, sql_text, and the tsql_stack fields.


On the filter tab, you’ll see each of the predicates that are defined. This event will fire for error numbers: 17803, 701, 802, 8645, 8651, 8657 and 8902, as well as for any severity greater than or equal to 20. Most of these errors are related to out of memory issues, but the severity 20+ errors are all critical system issues.


On the Event Fields tab, you’ll see each of the fields that are automatically returned by the selected event; such as the error_number, severity, etc. The fields on this list will change based on which event is selected, because each event is designed to capture different information.


Now click on the Data Storage page. From here you will see each of the targets defined for the session. Targets are nothing more than places the session should store the data it collects. The system_health session as two targets defined; the ring_buffer and to an event_file. The ring_buffer target is an in-memory target designed to only hold the collected data for a short period of time. The event_file target is a file located on disk that holds the collected data.


The event_file will be used to retain the collected data for a longer period of time; however, it’s not indefinite. The default file size is only 5MB and the session only keeps 4 rollover files. This means the files could be overwritten within a few short days.

Click on the Advanced page. This section defines some of the global properties of the event session. For example, the event retention mode is set to “Single event loss”. This means if the event buffers fill up before SQL Server can write the events to the targets, it will drop a single event instead of causing significant performance degradation on the server. You can also define the maximum dispatch latency, which forces events to be flushed from memory to the targets at a defined interval. The maximum memory size defined how much memory is used to store the events before flushing them to the targets. Keep in mind that these two properties work in tandem, meaning the events will be flushed to the targets when either the max dispatch latency is reached first or the max memory size is reached first.


Now that we’ve covered what is being monitored, let’s look at how we can use the collected data.
Let’s say we encounter some data corruption within a database. From the errolog, we can see the error message, database ID, and the file and page numbers.

SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x419b8f57; actual: 0x419bea57). It occurred during a read of page (1:230) in database ID 9 at offset 0x000000001cc000 in file 'D:\MSSQL11.TEST\MSSQL\DATA\CorruptDB.mdf'.  Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

To view the captured event session data, we can either right click on the session and click “watch live data” or we can open the event file which is located in the Log folder under the install directory of the instance. The event_file target will always append _0_ and then some long number the end of each file name. The _0_ has no significance, but the number following it is the number of milliseconds since January 1, 1600. This is done to help guarantees filename uniqueness. Once you have opened the file, scroll down to the timestamp of the error and you should see an event called error_reported. Select that event and then you’ll see the event details in the lower window.


One of the big advantages you’ll notice is the additional data being captured such as the session_id, sql_text, and tsql_stack. These are some of the actions that were define (above) for the session so that SQL Server collected this additional data at the time of the error. While this example may not require you to know the TSQL that was running, this extra info will be very helpful if you run into one of the out of memory errors.

As exciting as the system_health session is, there are a couple of issues I’d like to point out surrounding this session. Anyone with permissions can easily alter or delete the system_health session, whether or not it was a mistake. If this does happen, the entire session can be recreated using the u_tables.sql script. The script is located in the Install folder which is in the default instance folder for SQL Server.


Just open it and scroll down to the bottom of the script and you will see the CREATE EVENT SESSION. Run that section and it will recreate the system_health event session. Reading through that script is also a good way of seeing what each event will capture. The Microsoft product team did a really good job of documenting this script.

The other issue is the settings for the event_file target. The max file size is only 5MB and only has 4 rollover files. In order for the file to retain data for longer period of time, I would suggest changing the max file size to something like 25MB and keep 10 rollover files.


If you’re looking for more info on extended events, be sure to check out Jonathan Kehayias’s blog series “An XEvent a Day”. It’s packed full of goodies about extended events.