The Java and .NET (C#) implementations of the algorithm have been attached to the post. They have been built as Class Libraries to support reusability.
Standard Compression Scheme for Unicode
As evident from the post title, the algorithm used in Unicode Data Compression is the Standard Compression Scheme for Unicode (SCSU). SCSU is a technical standard for reducing the number of bytes needed to represent Unicode text. It encodes a sequence of Unicode characters as a compressed stream of bytes. It is independent of the character encoding scheme of Unicode and can be used for UTF-8, UTF-16 and UTF-32.
I am not going to explain the entire SCSU algorithm here but you can get its specifications from the Unicode Consortium. The SCSU algorithm processes input text in terms of their Unicode code points. A very important aspect of SCSU is that if the compressed data consists of the same sequence of bytes, it represents the same sequence of characters. However, the reserve isn’t true; there are multiple ways of compressing a character sequence.
Applications/Organizations using SCSU –
- Symbian OS uses SCSU to serialize strings.
- The first draft of SCSU was released by Reuters, a news service and former financial market data provider. Reuters is believed to use SCSU internally.
- As already mentioned, SQL Server 2008 R2 uses SCSU to compress Unicode text.
SQL Server and SCSU
SQL Server stores data in the compressed format only if it occupies lesser space than the original data. Moreover there must be at least three consecutive characters from the same code page for the algorithm to be triggered.
A major issue with the implementation is determining whether the stored text is in compressed or uncompressed format. To resolve this issue, SQL Server makes sure that the compressed data contains an odd number of bytes and adds special case characters whenever required.
The SQL Server implementation details are from a blog post by Peter Scharlock, a SQL Server Senior Program manager.
Though the specifications of SCSU are pretty compressive and a sample Java implementation is available at the Unicode Consortium, the implementation isn’t reusable as it is built as a Console App.
Using the specification and the sample Java implementation, I built a similar implementation as a Class Library to encourage reuse. The implementation is available in two languages – Java and C#.
The Java implementation is made available as a NetBeans project and the .NET implementation is made available as a Visual Studio solution. The implementations come along with a sample Front End which uses the corresponding Class Library.
- The .NET implementation differs from the Java implementation on a basic fact that the byte in Java is signed and the byte in .NET is unsigned. sbyte is available in .NET but using byte is more comfortable.
- Both the implementations have been tested for UTF-16 Little Endian encoding schemes. Since the default behavior of a Unicode character in Java is Big Endian, a few tweaks have been implemented.
To check the integrity of the SCSU implementations, a sample text file contains text from German, Russian and Japanese (the same text available at the SCSU specifications site) is taken and verified if the compressed text is as expected. The size of the file was compressed from 274 bytes to about 199 bytes.