Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ And I thought my implementation of Deflate was bad, part 2

A few years ago I posted about poor compression ratios from the .NET Framework's System.IO.Compression library, more specifically files getting larger when going through compression using the .NET Framework 2.0 library. Fast forward to the more recent .NET Framework 4.0, where the changelog includes the following:

Compression Improvements

The compression algorithms for the System.IO.Compression.DeflateStream and System.IO.Compression.GZipStream classes have improved so that data that is already compressed is no longer inflated. This results in much better compression ratios.

Uh huh. Let's test that, shall we? :)

We'll use a revised version of the C# test program I used the last time, one which feeds an entire buffer to the compressor. This is necessary since GZipStream is unfortunately sensitive to write block sizes:

using System;
using System.IO;
using System.IO.Compression;
namespace VCZipTest {
class Program {
static void Main(string[] args) {
  using (FileStream src = new FileStream(args[0], FileMode.Open)) {
    using (FileStream dst = new FileStream(args[1], FileMode.Create)) {
      using (GZipStream def = new GZipStream(dst, CompressionMode.Compress, true)) {
        byte[] buf = new byte[(int)src.Length];
        int act = src.Read(buf, 0, buf.Length);
        def.Write(buf, 0, act);
        def.Flush();
      }
          System.Console.WriteLine("{0} ({1} bytes) -> {2} ({3} bytes)", args[0],
               src.Length, args[1], dst.Length);
    }
  }
}
}
}

Some background

First, I should probably explain a bit about the gzip/Deflate algorithm and clear up some misconceptions.

The compression algorithm itself is called Deflate and is described by RFC 1951. Deflate compressed data can be wrapped in either a minimalistic gzip stream described by RFC 1952, or in the more well known .ZIP format introduced and made popular by PKWARE's PKZIP tool, which includes multiple compressed files along with directory information. Of concern here is only the implementation of the Deflate algorithm itself, the method by which the raw bits are compressed.

Deflate consists primarily of two layers of compression. The first is LZ77 style sliding window compression, where repeated strings of bytes are replaced by references to the previous instance. What is not so obvious is that this layer is also capable of compressing runs of identical byte values or repeated patterns of byte values through self-overlapping runs. (This fact is rediscovered each time a C programmer sees the forward byte copy loop in the decoder and tries to "fix" it with memmove or "optimize" it with memcpy.) The second layer uses Huffman encoding to compress based on the probability of codes, which include both literal bytes and copy length codes. Thus, Deflate is able to compress not only based on patterns, but also based on some byte values occurring more often than others or not occurring at all.

It is provably true that no lossless compression algorithm can consistently compress all incoming streams, and some streams have to enlarge for others to shrink. However, the amount of enlargement required to satisfy this requirement is much lower than the 50%+ enlargement seen in using the .NET Framework 2.0 version of System.IO.Compression. One way to solve this problem is simply to add a single byte for a compression flag. I've seen discussions where people said that the .NET's DeflateStream/GZipStream is unable to take advantage of this because the compression flag is stored in the .ZIP container and not in the compressed stream itself, which is all that the .NET code handles. This is incorrect. While the .ZIP container does have a compressed/stored flag in the header for each file, the compressed Deflate stream itself is also composed of blocks such that each block can also be encoded as stored or compressed. This has a cost of a three byte header per stored block of up to 64K each, which still results in an upper bound much lower than a 150% ratio.

The other big misconception I've read is that .NET's implementation is hampered by supporting a stream instead of seeing the whole file. The main problem with this explanation is that stream compression doesn't preclude optimization on a per block basis; it simply requires that enough buffering be used to accommodate a block at a time, which is what zlib does. What would preclude per-block optimization is performing compression under such tight latency constraints that bits need to be shoved out the door almost as soon as they arrive and can be compressed, such as with a modem. Even then, it's possible to mitigate the issue by dynamically monitoring the compression ratio and switching encoding modes as needed.

The easy test

Let's start with a simple test, 100 null bytes. That should be easy for just about any compressor. First, .NET 2.0:

testnulls.bin (100 bytes) -> testnulls.prev (124 bytes)

Well, expectations were low for that one. This includes the small gzip stream header, which is about 10 bytes, but that's still bigger than if a stored block had been used (~113 bytes). Let's try .NET 4.0:

testnulls.bin (100 bytes) -> testnulls.gz (128 bytes)

That's... not promising. Remember, this is just about the easiest case possible for any compression algorithm. gzip -9 is able to get this down to 38 bytes, and its header is slightly bigger due to inclusion of a filename. What happened? Well, let's dump the .NET 2.0 output first:

New block: dynamic Huffman trees
Code 0: 14 bits
Codes 1-6: 14 bits
Code 7: 14 bits
Code 8: 14 bits
Code 9: 12 bits
Code 10: 6 bits
Code 11: 14 bits
Codes 12-17: 14 bits
Codes 18-23: 14 bits
Codes 24-29: 14 bits
...
Dynamic block header took 790 bits
Literal <00>
Reference <1, 99>: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
EOB

To review, a Deflate stream is composed of a series of blocks, where each block can have a different encoding mode. This stream consists of a single block in "dynamic" mode, where Huffman trees are included that are tailored for the data in the block. The main tree contains codes for both literal values -- encodings for single byte values -- as well as an end of block (EOB) code and length codes that indicate repeated sections to copy. After the trees comes the actual data, which consists of a mix of literal codes and references to data to copy, and ending with EOB.

In this case, the 2.0 encoder was correctly able to encode a run for the nulls, but it spent a whopping 790 bits on encoding a full Huffman tree that it didn't need. Not only that, but the tree is completely mismatched for this data: the only codes present are a literal 00, a single copy length code, and an end of block (EOB) code. 14 bits for a null byte in a file consisting of nothing but that definitely isn't a good choice, nor is allocating nodes for all of the other bytes that never occur. We'll get back to this in a second.

Now, the 4.0 encoder... why did it produce a bigger file than the 2.0 encoder?

Dynamic block header took 790 bits
Literal <00>
Reference <1, 99>: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
EOB
New block: stored mode (0 bytes)

Strange, the 4.0 encoder added an extra zero-byte stored block at the end of the file. A "stored" block is just raw data encoded as straight bytes, but a zero-byte stored block is useless. I can't think of why you'd do this in an encoder.

How about gzip?

New block: static Huffman trees
Literal <00>
Literal <00>
Reference <1, 98>: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
EOB

There is a third block mode in Deflate that I didn't mention earlier, static tree mode. In this mode, compression is used, but the encoding trees aren't specified; instead, the decoder assumes a known and mostly flat static tree that allocates 7-9 bits to literal codes, which a slight bias toward ranges used by ASCII text. The gzip encoder used this mode to avoid having to spend bits on encoding the trees, which likely would have been much less than the 790 bits used by GZipStream, but still more than would have been saved by encoding a more targeted set of trees. The two literal codes are slightly odd here; I don't see why gzip couldn't have just encoded a single literal like the .NET encoders.

A 6-bit test

The trees encoded by the .NET encoder were certainly strange and not very appropriate. Let's do another test, which is to encode a 1K block of 6-bit pseudorandom data produced by this C++ program:

#include <stdio.h>

int main() {
    FILE *f = fopen("rand6.bin", "wb");
    unsigned seed = 1;

    for(int i=0; i<1024; ++i) {
        putc(seed & 0x3f, f);
        seed = (seed << 8) + (((seed >> (28 - 8)) ^ (seed >> (31 - 8))) & 0xff);
    }
}

With only 6 of 8 bits pertinent, and with the pseudorandom data mostly suppressing the pattern-loving sliding window compression, we should expect a compression ratio of about 75%. Reverse order this time, gzip first:

New block: dynamic Huffman trees
Code 0: 5 bits
Code 1: 4 bits
Code 2: 6 bits
Codes 3-7: 6 bits
Code 8: 5 bits
...
Code 273: 3 bits
Code 274: 5 bits
Code 275: 3 bits
Code 276: 5 bits
Code 277: 3 bits
Dynamic block header took 293 bits
Literal <01>
Literal <00>
Literal <00>
Literal <00>
Literal <12>
Literal <00>
Literal <00>
Literal <01>
Literal <04>
Reference <7, 3>: 00 00 12
...

Total output including 20 byte stream header, 791 bytes. The LFSR I used to generate the data isn't that great and so gzip was able to find some redundancy to exploit, but that was balanced by the bits it had to spend on encoding the trees, which show the expected allocations of around 6 bits/value. This is the kind of behavior you'll see when a compressor gets mostly incompressible data, like an MP3 file or already compressed data: the Huffman trees are mostly balanced, but the sliding window compression starts finding tiny sections of repeated data here and there.

Next, .NET Framework 4.0:

rand6.bin (1024 bytes) -> rand6.gz (1052 bytes)

New block: stored mode (1024 bytes)
New block: stored mode (0 bytes)

Again, GZipStream doesn't do as good of a job as gzip, but at least this time it uses a stored block and doesn't make everything much worse. Now, .NET Framework 2.0:

New block: dynamic Huffman trees
Code 0: 14 bits
Codes 1-6: 14 bits
Code 7: 14 bits
Code 8: 14 bits
Code 9: 12 bits
Code 10: 6 bits
Code 11: 14 bits
Codes 12-17: 14 bits
...
Dynamic block header took 790 bits
Literal <01>
Literal <00>
Literal <00>
Literal <00>
Literal <12>
Literal <00>
Literal <00>
Literal <01>
Literal <04>
Reference <7, 3>: 00 00 12

Uh oh... this tree seems familiar. Let's do another test, and compress the source code to the test program itself (.NET 4.0):

..netzip.cs (682 bytes) -> foo.gz (448 bytes)

New block: dynamic Huffman trees
Code 0: 14 bits
Codes 1-6: 14 bits
Code 7: 14 bits
Code 8: 14 bits
Code 9: 12 bits
Code 10: 6 bits
Code 11: 14 bits
Codes 12-17: 14 bits
Codes 18-23: 14 bits
Codes 24-29: 14 bits
...
Dynamic block header took 790 bits

Very different data, same trees.

As I noted the last time in the comments, the .NET Framework 2.0 compressor appeared to be using a predefined Huffman tree that was tailored for text, allocating 14 bits for most control codes and 6 bits for spaces, instead of generating an optimal tree based on actual probabilities. This does not appear to have changed in 4.0. No matter what data I gave the encoder, it always used this tree whenever it emitted dynamic blocks. Doing so simplifies and speeds up the encoder, but the downside is that it produces suboptimal results on both text and binary data. For text, coding space is wasted on lots of bytes that don't actually appear, but have to be in the tree in case they do; for binary data, it's seriously bad because the most common bytes in binary data suffer terrible expansion (8 bits to 14 bits). Unfortunately, this does not appear to have been fixed in the 4.0 version of compressor, which means that the Huffman portion of the Deflate algorithm is largely ineffective and sometimes counterproductive.

I find this design decision odd. By not generating an optimal set of Huffman trees, the compressor is able to emit codes almost as soon as they are produced and there is no need to buffer the output data. However, in terms of performance, the encoder already needs to handle bit packing in order to emit with the static trees, so the additional time needed is just that to measure and create optimal trees. That's just a histogram and a quick tree node merging algorithm, which doesn't take long. I suspect that most .NET programmers would not mind a little bit of delay and an additional 64K of memory to optimize the trees properly.

Buffer size issues

Another oddity of GZipStream is its sensitivity to buffer sizes. With the test program above, where the entire buffer is fed to the encoder, the output consists of a series of nearly 4K blocks:

New block: stored mode (4090 bytes)
New block: stored mode (4090 bytes)
New block: stored mode (4090 bytes)
...

I have no idea why 4090 bytes, since the overhead for a stored block is three bytes, and a stored block can be up to 65535 bytes. That's suboptimal, but a very minor transgression. Writing blocks of 4096 bytes, however, produces worse output:

New block: stored mode (4090 bytes)
New block: stored mode (6 bytes)
New block: stored mode (4090 bytes)
New block: stored mode (6 bytes)
...

What appears to be happening is that the compressor doesn't use its buffer as a sliding window, but rather keeps resetting it. This fragments the output into more blocks and also has the undesirable property of making the compression ratio dependent upon the buffer size used. Furthermore, occasionally the compressor will do really odd things like this:

New block: stored mode (4090 bytes)
New block: dynamic Huffman trees
Code 0: 14 bits
Codes 1-6: 14 bits
Code 7: 14 bits
Code 8: 14 bits
Code 9: 12 bits
...
Code 313: 5 bits
Code 314: 5 bits
Code 315: 5 bits
Dynamic block header took 790 bits
Literal <b4>
Literal <34>
Literal <65>
Literal <30>
Literal <6b>
Literal <4e>
EOB

It appears in this case that the compressibility check failed; the encoder emitted a stored block of 4090 bytes, and then used a dynamic block for the rest... using 106 bytes to encode 6 bytes. This pattern occurs several times throughout the file.

Conclusion

Given all of this, the conclusion we can draw is that GZipStream and DeflateStream are not significantly improved in .NET Framework 4.0. The compressor now has checks to prevent major expansion of data, but you can expect compression ratios across the board to be significantly worse than standard implementations like zlib. As such, I would highly suggest using another library for compression needs.

Comments

Comments posted:


I find it hard to believe Microsoft can't implement zlib in C#. Or wrap the C code in C++/CLI to call it from C#. They have the guy who wrote the cab compression software. Was this outsourced or something? why hasn't somebody else implemented zlib in C# in the last 10 years, if the Microsoft implementation is that bad?

Z.T. - 29 03 11 - 21:49


I think you were too generous to Microsoft by assuming the decision to use a fixed, but non-standard Huffman tree to be a conscious one. This confluence of bad decisions just points to somebody with zero experience being tasked to develop compression. Even something as simple as LZ77 or LZSS is very easy to get wrong, or slow, or both.

Given that stuff like 7zip is generally available, I don't see why anybody would use Microsoft's implementation.

Stefan - 30 03 11 - 01:22


@z.t. http://www.componentace.com/zlib_.NET.ht.. is a C# implementation of zlib.

Paul - 30 03 11 - 02:01


zlib is probably the single most widely deployed, heavily tested library in the world. Writing a new deflate implementation from scratch was patently absurd.

Glenn Maynard - 30 03 11 - 02:57


"Writing a new deflate implementation from scratch was patently absurd". Was not because of licensing reasons. The implementation is still horrible but at least it is correct. SharpZipLib has problems with >2gb files for example. In general quality of microsoft libs is very good which makes it even more surprising how ridiculous this one is.

tobi - 30 03 11 - 07:31


zlib is permissively licensed, and used in thousands of programs (probably closer to hundreds of thousands, including every user of libpng). If there are any "licensing reasons", they're the bogus contrivances of lawyers desperate to look like they're worth continued employment.

Glenn Maynard - 30 03 11 - 09:44


I'm pretty sure whoever wrote the original encoder did have experience in compression, because the LZ77 portion of the encoder is written competently. It's the stream interface and Huffman coding parts that aren't. One possibility is that the original encoder might have been written in C under some tight constraint like HTTP header compression in kernel mode, and it was inappropriately ported by someone who wasn't as savvy. The biggest transgression here isn't the performance of the encoder -- it's that it was published in System.IO.Compression without any hint of the specialized nature.

As for not using zlib, keep in mind that there many be many reasons why a zlib-derived module could not be used. One of them is if the platform needs to be licenseable to parties that will not accept any third party components.

Phaeron - 30 03 11 - 16:07


I've had good luck with Ionic.Zip, which is supposed to be based on a port of Zlib. http://dotnetzip.codeplex.com/

Matt - 30 03 11 - 18:53


Stored blocks have a 5 byte header, not 3 (see section 3.2.4). Well, technically the first byte is actually two bits plus alignment, but in any case the expansion for incompressible data would converge to 65540/65535. However zlib splits blocks earlier (8K I think - might depend on settings).

It'd be interesting to check the speed and also whether it supports lazy matching (discard current match and code one byte as literal in order to have a longer match later on).

Den - 30 03 11 - 23:47


@den: ignoring the CTYPE bits (2 bits), any any filler bits to get to an octet alignment, the stored block header is 4 octets.

As for why someone would have a zero-length stored block: sync flush. It would force octet alignment for the following data and flush any remaining bits in the encoder, allowing them to be sent over a pipe or socket. Why someone would do this at the end of a file, I don't know. Perhaps MS intended to use those classes for actual streaming of data, which would account for the sync flush at the end.

When I was playing around with deflate, I also noticed the strange behavior of zlib outputting two literal nulls before switching to references.

Coderjoe - 31 03 11 - 09:57


I also was ignoring the BFINAL bit. (and it is BTYPE not CTYPE. memory fail.)

Coderjoe - 31 03 11 - 10:26


@den:
Good catch, I had forgotten about the inverted length bytes.

I don't have VS2008 and I'm too lazy to try to set up source server support to grab the source code, so the best I was able to do was browse the IL with ildasm. Best I can tell, GZipStream/DeflateStream are hardcoded to either a 4K or 8K max window, but do have support for lazy matches. Speed-wise I don't think it's as good as zlib, but it's not fair comparing a C# compressor against a native C+asm based one.

Phaeron - 31 03 11 - 15:36


Why don't you use reflector?

Gabest - 10 04 11 - 08:52


In .Net 4.5, zlib is used. So it will finally be actually improved. http://msdn.microsoft.com/en-us/magazine..

codekaizen - 05 06 12 - 06:13


> In .Net 4.5, zlib is used.
But, apart of omitting the two-bytes and adler crc (which might be justified, as long as we are speaking of a deflated stream and not a real zlib stream), they do not let set the real numeric compression level (why, o why that idiocy of an "user friendly" enum for that, and with 3 values? the compression level is standard) and, more important for me, they don't let chose the "strategy". If they are using zlib, that should have taken 30 minutes (5 for coding, the rest for test and doc).

hernan (link) - 08 12 12 - 13:40

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.