Sunday, September 26, 2010

Standard Compression Scheme For Unicode (SCSU) - Java & .NET Implementations

In the previous article, I introduced the compression techniques available in SQL Server and highlighted the Unicode Data Compression feature of SQL Server 2008 R2. This post will cover the algorithm used by SQL Server for Unicode compression.

The Java and .NET (C#) implementations of the algorithm have been attached to the post. They have been built as Class Libraries to support reusability.

Standard Compression Scheme for Unicode

As evident from the post title, the algorithm used in Unicode Data Compression is the Standard Compression Scheme for Unicode (SCSU). SCSU is a technical standard for reducing the number of bytes needed to represent Unicode text. It encodes a sequence of Unicode characters as a compressed stream of bytes. It is independent of the character encoding scheme of Unicode and can be used for UTF-8, UTF-16 and UTF-32.

I am not going to explain the entire SCSU algorithm here but you can get its specifications from the Unicode Consortium. The SCSU algorithm processes input text in terms of their Unicode code points. A very important aspect of SCSU is that if the compressed data consists of the same sequence of bytes, it represents the same sequence of characters. However, the reserve isn’t true; there are multiple ways of compressing a character sequence.

Applications/Organizations using SCSU -

  • Symbian OS uses SCSU to serialize strings
  • The first draft of SCSU was released by Reuters, a news service and former financial market data provider. Reuters is believed to use SCSU internally
  • As already mentioned, SQL Server 2008 R2 uses SCSU to compress Unicode text

To be honest, SCSU has not been a major success. There are very few applications which need to compress Unicode Data using a special compression scheme. In certain situations, especially when the text contains characters from multiple character sets, the compressed text can end up being larger in size than the uncompressed one.

SQL Server and SCSU

SQL Server stores data in the compressed format only if it occupies lesser space than the original data. Moreover there must be at least three consecutive characters from the same code page for the algorithm to be triggered.

A major issue with the implementation is determining whether the stored text is in compressed or uncompressed format. To resolve this issue, SQL Server makes sure that the compressed data contains an odd number of bytes and adds special case characters whenever required.

The SQL Server implementation details are from a blog post by Peter Scharlock, a SQL Server Senior Program Manager.

SCSU Implementation

Though the specifications of SCSU are pretty compressive and a sample Java implementation is available at the Unicode Consortium, the implementation isn’t reusable as it is built as a Console App.

Using the specification and the sample Java implementation, I built a similar implementation as a Class Library to encourage reuse. The implementation is available in two languages - Java and C#.

The Java implementation is made available as a NetBeans project and the .NET implementation is made available as a Visual Studio solution. The implementations come along with a sample Front End which uses the corresponding Class Library.

If you are planning to modify the source code of the implementations, please keep these points in mind -

  • The .NET implementation differs from the Java implementation on a basic fact that the byte in Java is signed and the byte in .NET is unsigned. sbyte is available in .NET but using byte is more comfortable
  • Both the implementations have been tested for UTF-16 Little Endian encoding schemes. Since the default behavior of a Unicode character in Java is Big Endian, a few tweaks have been implemented

Example

To check the integrity of the SCSU implementations, a sample text file contains text from German, Russian and Japanese (the same text available at the SCSU specifications site) is taken and verified if the compressed text is as expected. The size of the file was compressed from 274 bytes to about 199 bytes.

The source code of the Java and .NET implementations can be found here. If you find any bugs, please let me know and I will make the necessary modifications.

7 comments:

Gideon Vos said...

Hi, the attached code sample .rar only includes the Java implementation, could you include the .NET library code as well please?

Great job BTW.

Mark S. Rasmussen said...

Excellent work. Under what license is the code made available?

I checked the code but I can't seem to find any license descriptions in there.

Unknown said...

Gideon, the .NET source code is available in the same zip file.

Unknown said...

Mark, I haven't bound the code to any open source license. You can use it straight away.

I would be glad if you place a link to this blog-post in your project and a link to your project in the comments section :)

Doug Ewell said...

Nice work. I noticed some opportunities for improvement in compression performance. I've written a SCSU compressor in C and described the process in Unicode Technical Note #14. Are you interested in discussing this offline?

Doug Ewell said...

Found a bug when encoding a sample using multiple blocks in the non-BMP space (Cuneiform). When the encoder cycles back to window 0, the test in PositionWindow() for position < 0x100 passes, which means a window gets set in the BMP to hold (temp & 0xFFFF), not what is needed. You should test for DynamicOffset[window] < 0x10000 instead of position < 0x100.

navya said...

I have read your blog its very attractive and impressive. I like it your blog.

.Net Training in Chennai | .Net Online Training | Dot Net Training in Chennai

Dot Net Online Training | LINQ Online Training