Friday, September 17, 2010

Unicode Data Compression In SQL Server 2008 R2

SQL Server 2008 R2 was released a few months back and one of the features I found interesting was its ability to compress Unicode data. In this post, I will be introducing the various compression options available in SQL Server and towards the end I will emphasize a sample analysis used to estimate the efficiency of Unicode Data Compression and the Compression-Ratio improvements of SQL Server 2008 R2 over SQL Server 2008.

In my next post, I will emphasize on the actual algorithm used by SQL Server to achieve this compression and will provide the Java and .NET implementations of the algorithm.

Compression Techniques in SQL Server

In computer science, compression is the process of encoding information in fewer bits than an un-encoded representation would use. The compression techniques available in SQL Server can be broadly categorized into two types depending on the way they are architected - Data Compression and Backup Compression.

Data compression occurs at runtime, the data is stored in a compressed form to reduce the disk-space occupied by a database. On the other hand, backup compression occurs only at the time of a backup and uses a proprietary compression technique. Backup compression can be used on a database that has already undergone data compression, but the savings might not be significant.

Data compression is again of two types - Row level Data Compression and Page level Data Compression. Row level compression primarily turns fixed-length data-types into variable data-types, thereby saving space. It also ignores zero and null values saving additional space. Because of this, more number of rows can be accommodated in a single data page. Page level compression initially performs Row Level compression and adds two additional compression features - Prefix and Dictionary Compression. As evident, page level compression offers better space saving than row level compression.

Though compression can provide significant space saving, it can also cause severe performance issues if misused. For further reading on compression, refer "An Introduction to Data Compression in SQL Server 2008".

As the name suggests, Unicode Data Compression comes under Data Compression, and to be more specific it's a part of Row-level Compression.

Sample Analysis

Microsoft had promising stats on the Unicode Data Compression in SQL Server 2008 R2, going up to a 50% space savings on a few character sets like Hindi, German, etc. So I decided to give it a try myself.

Being from India, I decided to test the compression ratios for Hindi text. I created a randomizer in C# (.NET) to generate random text from a few Hindi phrases obtained from Linguanaut. The program generates 1.5 million random Hindi strings and writes them into a temporary file which is Bulk Inserted into a table.

To check the improvement of SQL Server 2008 R2 over SQL Server 2008 in terms of Data Compression, two separate instances of SQL Server were established on the same system configuration (Intel Core 2 Quad and 4 GB RAM). Both the instances had the same schemas for the databases and the tables. The Randomizer and the Schemas + Bulk Insert scripts are attached below.

A major drawback of Unicode Data Compression in SQL Server 2008 R2 was that it couldn't be applied on columns of the data type NTEXT and NVARCHAR(MAX) and to highlight this we used two different tables, one using NTEXT and another using NVARCHAR(250).

Here is a quick reference table of the compression-ratios obtained in the analysis -

SQL Server 2008
SQL Server 2008 R2

From the above table, we can observe the compression-ratio for Unicode Data in SQL Server 2008 R2 is around 57% (nearly the space saving mentioned by Microsoft). However in all the other cases, we can observe that the saving savings is almost negligible. For space savings of other character sets refer "Unicode Compression (MSDN)".

Get the Visual Studio Solution of the Randomizer and the Database Scripts here.

SQL Server 2008 R2 Screen Shots

SQL Server 2008 Screen Shots