Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Donate
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ A not so good way to decode Huffman codes

Having made a working Huffyuv decoder, I took a shot at making it faster than the 2.1.1 decoder... and failed.

I guess I'll present the algorithm here as a curiosity. Perhaps someone can suggest a variant that's actually worthwhile.

Huffyuv encodes its main bitstream as a series of interleaved Huffman-encoded codes, with the order being a repeating Y-Cb-Y-Cr for the YUY2 mode, and the output being a series of byte values or byte deltas, which then possibly go through the predictor. There are thus three variable length decoders involved. The traditional way to handle this is to break down each code into a series of tables and select the right table according to the next N bits in the stream, either by a branch tree of comparisons against the window value, or by some faster way of looking up the ranges. In particular, the Bit Scan Forward (BSF) and Bit Scan Reverse (BSR) instructions on x86 are attractive for this.

I decided to try a different tack, which was to build a state machine out of all three of the tables simultaneously, with the form: (current state, input byte) -> (next state, advance_input, advance_output, output_byte). There are a few advantages to doing this, one being that no bit manipulations are needed on the input stream since it is always consumed a byte at a time, and no branches at all are required. All three decoders live in the same state machine, so in the end, the decoding loop looks like this:

movzx edx, byte ptr [ecx] ;read next byte
mov   eax, [eax + edx*4]  ;read next state
mov   [ebx], al           ;write output byte
and   eax, 0ffffff00h     ;remove output byte
add   eax, eax            ;shift out input_advance and compute next_state*2
adc   ecx, 0              ;advance input pointer
add   eax, eax            ;shift out output_advance and compute next_state*4
adc   ebx, 0              ;advance output pointer
add   eax, edi            ;compute next state address
cmp   ebx, ebp            ;check for end of decoding
jne   decloop             ;if not, decode more bytes

Now, one of the annoying issues with trying to optimize a Huffman decoder, or a decoder for any sort of variable-length prefix code for that matter, is that it's an inherently serial procedure. You can't begin decoding the second code in parallel until you know where the first one ends. (Okay, you could if you had a parallel decoder for every bit position, but that's typically impractical.) That means the decoding speed is determined by the length of the critical dependency path, which in this case is 9 instructions (movzx, mov, mov, and, add, add, add, cmp, jne). I suspect I could trim that down a little bit if I hardcoded the base of the table and rearranged the code somewhat, but it turns out to be irrelevant. For the standard Huffyuv 2.1.1 YUY2 tables, the state machine turns out to be 1.8MB in size, and that means the decoding routine spends most of its time in cache misses on the instruction that fetches the state entry, the second instruction. Rats.

In the end, it did work, but was around 20-30% slower than a SHLD+BSR based decoder, at least on a T9300. That doesn't count the bitstream swizzle and prediction passes, but those take almost no time in comparison. It might be more lucrative for a smaller sized variable length code or one where the minimum code length is 8 bits and the conditional input advance could be dropped.

In general, it seems pretty hard to beat a SHLD+BSR decoder. This is particularly the case on a PPro-based architecture where BSR is very fast, two clocks for PPro/P2/P3/PM/Core/Core2, and one clock for enhanced Core 2 (!). The P4s seem a bit screwed in general, because while BSR is slow, so's practically everything else. Athlons are a bit weird -- they have slow BSR/BSF like the original Pentium and slow scalar<->vector register moves, but they're fast at scalar ops. That probably explains why I saw a branch tree instead of BSR the last time someone sent me a crash dump in Huffyuv 2.2.0's decoder....

I'm tempted to try putting a first-level table check on the decoder to see if that helps. The way this works is that you pick a number of bits, k, and determine how many codes are of length k or less. You encode those directly in a table of size 2^k as (code, length) pairs and everything else goes to more tables or an alternate decoder. Ideally, the effectiveness of this table is determined simply by how many entries are encoded directly in it, e.g. if 1800/2048 entries can use the fast path then you'll hit the fast path 87% of the time. In practice, this can vary depending on how well the distribution of the encoded data matches the distribution used to create the prefix code tree; they may not always match well when the tree is static, as is the case in vanilla Huffyuv. It's also questionable in this case because the advantage over the BSR is slight and the need for a fallback from the table requires adding an asymmetric but poorly predicted branch.

As a final note, most of these optimizations are only possible when Huffman codes are placed in the bitstream starting from the MSB of each word, as Huffyuv does, and as many standard encodings do, including JPEG and MPEG. Deflate's a pain in the butt to decode in comparison because it places codes starting at the LSB, which means you can't easily treat the bitstream as a huge binary fraction and use numeric comparisons, as the methods I've described above do.

Comments

Comments posted:


This is going off on a tangent, but I'm currently trying to get my head around Deflate and since you've mentioned the part of it that bugs me the most - do you know why they picked the LSB order? I mean all I see it doing is introduce logical shifts in my (granted, trivial scenario) code. Waste of cycles with no benefit.

keije - 05 06 08 - 15:39


No, I don't. Given that Deflate was invented in the 16-bit DOS days, I wouldn't be surprised if it had something do with code size constraints. I've heard that PKZIP didn't always generate optimal trees because it used the Shannon-Fano algorithm (splitting) instead of Huffman (merging); it's possible that it also processed the bitstream a bit at a time and in that vein it may have been more convenient to be able to isolate the LSB with an AND operation.

Phaeron - 06 06 08 - 01:52


Why not make it a two-pass algorithm? First pass could mark beginnings and the second pass could then decode in parallel. Would be much faster IMO.

Igor Levicki (link) - 07 06 08 - 11:39


You'd spend more time shuffling data around than simply just doing the bitstream decoding entirely on one thread. You already have to do enough table lookups to get the bit length of a code, so it's a no-brainer to also stick the code next to the bit length in the same table. You could then put the post-bitstream-decoding work on another thread, but the problem is that in Huffyuv bitstream decoding is practically all it does.

Phaeron - 07 06 08 - 14:38


Crazy idea time - since you are adding your own implementation to vdub, you could tweak the bitstream and make it a derivative codec that would play nicer :)

keije - 07 06 08 - 18:00


I didn't neccessarily mean another thread when I said in parallel, I meant prepare in first pass and decode using SIMD in second pass if possible using integer shuffles palignr, etc. That is bound to become even faster in next CPU generation since Nehalem will be able to perform two 128-bit shuffles in a single clock.

Igor Levicki (link) - 07 06 08 - 20:43


PALIGNR requires a immediate argument, so it's difficult to use in anything variable like a bitstream decoder. Another problem is that you need to get data between the MMX/XMM and scalar registers in order to handle table indexing, which is really slow on P4 -- although I noticed that it's fast on anything PPro/PM/Core derived. I'm not convinced that there's much room to optimize here given that on a Core 2 you already have fast scalar operations throughout. You're talking about less than a dozen ops that are already single clock or nearly single clock, and with GPRs you can index directly off the result whereas with XMM you've got to move it to GPRs first. I had thought about keeping the bit table entirely in XMM registers, but when you count out the ops it just doesn't come out ahead. You're basically competing against something like: MOV, MOV, SHLD, OR, BSR, MOV, SHR, MOV, OR, ADD, ADC. There's very little parallelism you could get from that, and none that I see that the CPU itself wouldn't already find.

Phaeron - 07 06 08 - 22:49


Now I am curious... Could you point me to the relevant source code so I can take a look?

Igor Levicki (link) - 08 06 08 - 11:34


Disregard the above comment, I already found it myself.

Igor Levicki (link) - 08 06 08 - 11:42


Avery, is there any equivalent C code for that assembler? I would love to see how a modern compiler would optimize it.

Igor Levicki (link) - 08 06 08 - 15:04


Yes, actually:
http://forums.virtualdub.org/index.php?a..

It's in src/Meia/source/decode_huffyuv.cpp.

Visual C++ doesn't generate very good code for it, mainly because its code generator blows with 64-bit types. While working on it, I ended up filing no less than three bugs just based on 64-bit shifts alone (346077, 346078, and 348683). I did manage to get it to generate decent enough code that it's usable.

Phaeron - 08 06 08 - 15:58


Don't start me about compiler bugs...

I checked the code. I see it uses variable length codes. Intel Performance Primitives library has the optimized functions for VLC coding/decoding -- you might want to evaluate those. If you point me to the line of that source file which is the hot-spot in your actual performance measurements I could perhaps look into it a little bit more when I have time.

Igor Levicki (link) - 22 06 08 - 01:31

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.