Sunday, September 26, 2010

Standard Compression Scheme For Unicode (SCSU) - Java & .NET Implementations

In the previous article, I introduced the compression techniques available in SQL Server and highlighted the Unicode Data Compression feature of SQL Server 2008 R2. This post will cover the algorithm used by SQL Server for Unicode compression.

The Java and .NET (C#) implementations of the algorithm have been attached to the post. They have been built as Class Libraries to support reusability.

Standard Compression Scheme for Unicode

As evident from the post title, the algorithm used in Unicode Data Compression is the Standard Compression Scheme for Unicode (SCSU). SCSU is a technical standard for reducing the number of bytes needed to represent Unicode text. It encodes a sequence of Unicode characters as a compressed stream of bytes. It is independent of the character encoding scheme of Unicode and can be used for UTF-8, UTF-16 and UTF-32.

I am not going to explain the entire SCSU algorithm here but you can get its specifications from the Unicode Consortium. The SCSU algorithm processes input text in terms of their Unicode code points. A very important aspect of SCSU is that if the compressed data consists of the same sequence of bytes, it represents the same sequence of characters. However, the reserve isn’t true; there are multiple ways of compressing a character sequence.

Applications/Organizations using SCSU -

  • Symbian OS uses SCSU to serialize strings
  • The first draft of SCSU was released by Reuters, a news service and former financial market data provider. Reuters is believed to use SCSU internally
  • As already mentioned, SQL Server 2008 R2 uses SCSU to compress Unicode text

To be honest, SCSU has not been a major success. There are very few applications which need to compress Unicode Data using a special compression scheme. In certain situations, especially when the text contains characters from multiple character sets, the compressed text can end up being larger in size than the uncompressed one.

SQL Server and SCSU

SQL Server stores data in the compressed format only if it occupies lesser space than the original data. Moreover there must be at least three consecutive characters from the same code page for the algorithm to be triggered.

A major issue with the implementation is determining whether the stored text is in compressed or uncompressed format. To resolve this issue, SQL Server makes sure that the compressed data contains an odd number of bytes and adds special case characters whenever required.

The SQL Server implementation details are from a blog post by Peter Scharlock, a SQL Server Senior Program Manager.

SCSU Implementation

Though the specifications of SCSU are pretty compressive and a sample Java implementation is available at the Unicode Consortium, the implementation isn’t reusable as it is built as a Console App.

Using the specification and the sample Java implementation, I built a similar implementation as a Class Library to encourage reuse. The implementation is available in two languages - Java and C#.

The Java implementation is made available as a NetBeans project and the .NET implementation is made available as a Visual Studio solution. The implementations come along with a sample Front End which uses the corresponding Class Library.

If you are planning to modify the source code of the implementations, please keep these points in mind -

  • The .NET implementation differs from the Java implementation on a basic fact that the byte in Java is signed and the byte in .NET is unsigned. sbyte is available in .NET but using byte is more comfortable
  • Both the implementations have been tested for UTF-16 Little Endian encoding schemes. Since the default behavior of a Unicode character in Java is Big Endian, a few tweaks have been implemented

Example

To check the integrity of the SCSU implementations, a sample text file contains text from German, Russian and Japanese (the same text available at the SCSU specifications site) is taken and verified if the compressed text is as expected. The size of the file was compressed from 274 bytes to about 199 bytes.

The source code of the Java and .NET implementations can be found here. If you find any bugs, please let me know and I will make the necessary modifications.

Friday, September 17, 2010

Unicode Data Compression In SQL Server 2008 R2

SQL Server 2008 R2 was released a few months back and one of the features I found interesting was its ability to compress Unicode data. In this post, I will be introducing the various compression options available in SQL Server and towards the end I will emphasize a sample analysis used to estimate the efficiency of Unicode Data Compression and the Compression-Ratio improvements of SQL Server 2008 R2 over SQL Server 2008.

In my next post, I will emphasize on the actual algorithm used by SQL Server to achieve this compression and will provide the Java and .NET implementations of the algorithm.

Compression Techniques in SQL Server

In computer science, compression is the process of encoding information in fewer bits than an un-encoded representation would use. The compression techniques available in SQL Server can be broadly categorized into two types depending on the way they are architected - Data Compression and Backup Compression.

Data compression occurs at runtime, the data is stored in a compressed form to reduce the disk-space occupied by a database. On the other hand, backup compression occurs only at the time of a backup and uses a proprietary compression technique. Backup compression can be used on a database that has already undergone data compression, but the savings might not be significant.

Data compression is again of two types - Row level Data Compression and Page level Data Compression. Row level compression primarily turns fixed-length data-types into variable data-types, thereby saving space. It also ignores zero and null values saving additional space. Because of this, more number of rows can be accommodated in a single data page. Page level compression initially performs Row Level compression and adds two additional compression features - Prefix and Dictionary Compression. As evident, page level compression offers better space saving than row level compression.

Though compression can provide significant space saving, it can also cause severe performance issues if misused. For further reading on compression, refer "An Introduction to Data Compression in SQL Server 2008".

As the name suggests, Unicode Data Compression comes under Data Compression, and to be more specific it's a part of Row-level Compression.

Sample Analysis

Microsoft had promising stats on the Unicode Data Compression in SQL Server 2008 R2, going up to a 50% space savings on a few character sets like Hindi, German, etc. So I decided to give it a try myself.

Being from India, I decided to test the compression ratios for Hindi text. I created a randomizer in C# (.NET) to generate random text from a few Hindi phrases obtained from Linguanaut. The program generates 1.5 million random Hindi strings and writes them into a temporary file which is Bulk Inserted into a table.

To check the improvement of SQL Server 2008 R2 over SQL Server 2008 in terms of Data Compression, two separate instances of SQL Server were established on the same system configuration (Intel Core 2 Quad and 4 GB RAM). Both the instances had the same schemas for the databases and the tables. The Randomizer and the Schemas + Bulk Insert scripts are attached below.

A major drawback of Unicode Data Compression in SQL Server 2008 R2 was that it couldn't be applied on columns of the data type NTEXT and NVARCHAR(MAX) and to highlight this we used two different tables, one using NTEXT and another using NVARCHAR(250).

Here is a quick reference table of the compression-ratios obtained in the analysis -

SQL Server 2008
SQL Server 2008 R2
NTEXT
98.24%
98.25%
NVARCHAR(250)
95.68%
57.78%

From the above table, we can observe the compression-ratio for Unicode Data in SQL Server 2008 R2 is around 57% (nearly the space saving mentioned by Microsoft). However in all the other cases, we can observe that the saving savings is almost negligible. For space savings of other character sets refer "Unicode Compression (MSDN)".

Get the Visual Studio Solution of the Randomizer and the Database Scripts here.

SQL Server 2008 R2 Screen Shots

SQL Server 2008 Screen Shots