¶A not so good way to decode Huffman codes
Having made a working Huffyuv decoder, I took a shot at making it faster than the 2.1.1 decoder... and failed.
I guess I'll present the algorithm here as a curiosity. Perhaps someone can suggest a variant that's actually worthwhile.
Huffyuv encodes its main bitstream as a series of interleaved Huffman-encoded codes, with the order being a repeating Y-Cb-Y-Cr for the YUY2 mode, and the output being a series of byte values or byte deltas, which then possibly go through the predictor. There are thus three variable length decoders involved. The traditional way to handle this is to break down each code into a series of tables and select the right table according to the next N bits in the stream, either by a branch tree of comparisons against the window value, or by some faster way of looking up the ranges. In particular, the Bit Scan Forward (BSF) and Bit Scan Reverse (BSR) instructions on x86 are attractive for this.
I decided to try a different tack, which was to build a state machine out of all three of the tables simultaneously, with the form: (current state, input byte) -> (next state, advance_input, advance_output, output_byte). There are a few advantages to doing this, one being that no bit manipulations are needed on the input stream since it is always consumed a byte at a time, and no branches at all are required. All three decoders live in the same state machine, so in the end, the decoding loop looks like this:
movzx edx, byte ptr [ecx] ;read next byte mov eax, [eax + edx*4] ;read next state mov [ebx], al ;write output byte and eax, 0ffffff00h ;remove output byte add eax, eax ;shift out input_advance and compute next_state*2 adc ecx, 0 ;advance input pointer add eax, eax ;shift out output_advance and compute next_state*4 adc ebx, 0 ;advance output pointer add eax, edi ;compute next state address cmp ebx, ebp ;check for end of decoding jne decloop ;if not, decode more bytes
Now, one of the annoying issues with trying to optimize a Huffman decoder, or a decoder for any sort of variable-length prefix code for that matter, is that it's an inherently serial procedure. You can't begin decoding the second code in parallel until you know where the first one ends. (Okay, you could if you had a parallel decoder for every bit position, but that's typically impractical.) That means the decoding speed is determined by the length of the critical dependency path, which in this case is 9 instructions (movzx, mov, mov, and, add, add, add, cmp, jne). I suspect I could trim that down a little bit if I hardcoded the base of the table and rearranged the code somewhat, but it turns out to be irrelevant. For the standard Huffyuv 2.1.1 YUY2 tables, the state machine turns out to be 1.8MB in size, and that means the decoding routine spends most of its time in cache misses on the instruction that fetches the state entry, the second instruction. Rats.
In the end, it did work, but was around 20-30% slower than a SHLD+BSR based decoder, at least on a T9300. That doesn't count the bitstream swizzle and prediction passes, but those take almost no time in comparison. It might be more lucrative for a smaller sized variable length code or one where the minimum code length is 8 bits and the conditional input advance could be dropped.
In general, it seems pretty hard to beat a SHLD+BSR decoder. This is particularly the case on a PPro-based architecture where BSR is very fast, two clocks for PPro/P2/P3/PM/Core/Core2, and one clock for enhanced Core 2 (!). The P4s seem a bit screwed in general, because while BSR is slow, so's practically everything else. Athlons are a bit weird -- they have slow BSR/BSF like the original Pentium and slow scalar<->vector register moves, but they're fast at scalar ops. That probably explains why I saw a branch tree instead of BSR the last time someone sent me a crash dump in Huffyuv 2.2.0's decoder....
I'm tempted to try putting a first-level table check on the decoder to see if that helps. The way this works is that you pick a number of bits, k, and determine how many codes are of length k or less. You encode those directly in a table of size 2^k as (code, length) pairs and everything else goes to more tables or an alternate decoder. Ideally, the effectiveness of this table is determined simply by how many entries are encoded directly in it, e.g. if 1800/2048 entries can use the fast path then you'll hit the fast path 87% of the time. In practice, this can vary depending on how well the distribution of the encoded data matches the distribution used to create the prefix code tree; they may not always match well when the tree is static, as is the case in vanilla Huffyuv. It's also questionable in this case because the advantage over the BSR is slight and the need for a fallback from the table requires adding an asymmetric but poorly predicted branch.
As a final note, most of these optimizations are only possible when Huffman codes are placed in the bitstream starting from the MSB of each word, as Huffyuv does, and as many standard encodings do, including JPEG and MPEG. Deflate's a pain in the butt to decode in comparison because it places codes starting at the LSB, which means you can't easily treat the bitstream as a huge binary fraction and use numeric comparisons, as the methods I've described above do.