v1.10.4 (stable)

Main page
Archived news
Documentation
Capture
Compiling
Processing
Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum

Other projects
Altirra

## § ¶Weighted averaging in SSE (part 2)

Last time I talked about a faster way to do parallel averaging between 8-bit components in SSE with unequal weights, specifically the [1 7]/8 case. This time, I'll improve on the [1 7] case and also show how to do the [3 5] case.

To recap, the idea centers around using the SSE pavgb instruction to stay within bytes by rounding off LSBs correctly each time we introduce a new intermediate result with an extra significant bit. Previously, I did this by converting a subtraction to an addition and using a correction factor. It turns out that there's another way to do this that doesn't require a correction factor and requires fewer constants, which means more temp registers available.

Recall that for the [1 7] case, we are trying to compute the expression:

r = round((a*7 + b)/8) = (a*7 + b + 4) >> 3

Given 8-bit input values, this gives a 11-bit intermediate result which must be rounded. Previously, I converted round((a*7 + b)/8) to (a + round((b-a)/8)), but it turns out that the first form is better. The reason why is that it can be decomposed as follows:

round((a*7+b)/8)
= (a*7 + b + 4) >> 3
= (a*4 + a*2 + a + b + 4) >> 3
= (((((a + b) >> 1) + a) >> 1) + a + 1) >> 1
= average_round_up(average_round_down(average_round_down(a, b), a), a)

average_round_up() is, of course, simply pavgb. The average_round_down() is a bit more tricky. Ordinarily you could do this by add + shift, but we can't do that because the add gives us a 9-bit result and we'd immediately lose one of the bits. There is, unfortunately, no variant of pavgb that rounds down, since it was introduced to speed up MPEG motion compensation and that specifies rounding up on 0.5. What we can do, however, is use a sort of mutant of DeMorgan's Theorem and complement both inputs and outputs to pavgb, which converts its natural round-up behavior to round-down:

average_round_down(a, b) = ~average_round_up(~a, ~b) = ~pavgb(~a, ~b)

This might not seem advantageous since it adds three additional operations, but if we fold it into the last expression, some of the complements cancel and there are some common subexpressions:

average_round_up(average_round_down(average_round_down(a, b), a), a)
= pavgb(average_round_down(~pavgb(~a, ~b), a), a)
= pavgb(~pavgb(~~pavgb(~a, ~b), ~a), a)
= pavgb(~pavgb(pavgb(~a, ~b), ~a), a)

Now we can write the SSE form:

`pcmpeqb mm7, mm7movd mm0, amovd mm1, bmovq mm2, mm0pxor mm0, mm7pxor mm1, mm7pavgb mm1, mm0pavgb mm1, mm0pxor mm1, mm7pavgb mm1, mm2`

Benchmarks show that this is faster than the previous method, but not by much (<10%). However, it only requires one constant register and three temporaries, so it is much easier to unroll this by 2x on x86 and 4x on x64 for a slight gain. And, of course, it's trivial to widen to 128-bit SSE2 operations.

Now, what about the [3 5] case? Well, it's basically the same thing, but the order of additions is a bit different:

`  round((a*5+b*3)/8)= ((a*5 + b*3) + 4) >> 3= (a*4 + a + b*2 + b + 4) >> 3= (((((a + b) >> 1) + b) >> 1) + a + 1) >> 1= average_round_up(average_round_down(average_round_down(a, b), b), a)`

Therefore, we can do the [3 5] case just by switching around which input is dominant in the second average:

`pcmpeqb mm7, mm7movd mm0, amovd mm1, bmovq mm2, mm0pxor mm0, mm7pxor mm1, mm7pavgb mm0, mm1pavgb mm0, mm1pxor mm0, mm7pavgb mm0, mm2`

In fact, if you think about it a little bit, this method can actually be generalized to any weighted sum where the factors add up to a power of two. It doesn't work if we have to add more than two values at any bit position, but because the weights add to a power of two, this never happens. Therefore, this method can be adapted to other useful weightings for chroma subsampling conversion, notably the 1/4, 3/4 case.

Neat trick that ~pavgb(~a, ~b). So the 1/4, 3/4 case would be this :-
.
. round((a*3+b*1)/4)
. = ((a*3 + b*1) + 2) >> 2
. = (a*2 + a + b + 2) >> 2
. = (((a + b) >> 1) + a + 1) >> 1
. = average_round_up(average_round_down(a, b), a)
.
. pcmpeqb mm7, mm7
. movq mm0, a
. movq mm1, b
. movq mm2, mm0
. pxor mm0, mm7
. pxor mm1, mm7
. pavgb mm0, mm1
. pxor mm0, mm7
. pavgb mm0, mm2
.
(Any chance of repairing my post in the previous thread and deleteing the duplicate.)

IanB - 30 08 08 - 18:00

Yup, you got it. I went ahead and fixed your comment, too.

Now, what I haven't covered is how to speed up the _horizontal_ cases, which are also needed for chroma sampling conversion. That's because I haven't gotten around to it yet. What I'm doing right now is ramping up a new blitter library in VirtualDub known as "uberblit." It already shipped in 1.8.x as the engine behind the YCbCr capable resize filter, but I'm trying to swap it in as the main library so I can support a lot more formats like interlaced 4:2:0 and 32F (the existing engine is not scalable). Problem is, the uberblit library is currently a lot slower than the main blitter on some format combinations due to multipassing, so I've been working on speeding it up. Right now the benchmarks show it faster than the main blitter for 109/144 combinations and at least 90% as fast for 132/144 combinations. A lot of the slow paths that are left are for conversions that aren't very interesting, like Y8 to RGB555. The other trick is how to expose the new formats without breaking existing code, which could be annoying. Currently the main blitter uses a single format code, whereas uberblit uses colorspace/sampling/format bitfields, which invalidates table lookups.

Phaeron - 31 08 08 - 16:37

Oh how much I love your assembler ingeniousness. I would not remotely come up with this on my own. So lets see if I can understand and learn from it.

-SPM-Mad - 01 09 08 - 15:26

The pavgb round down trick is pretty neat! It's better than pavg(a,b)-(a&b&1). Somewhat related to this is: How do you use pavgb to calculate an signed average of two vectors containing packed signed bytes?

pshufb - 02 09 08 - 03:01

You can rebias a signed value to unsigned space by XORing with the sign bit (0x80 for bytes), so the trick I use here to round down also works for converting the unsigned average into a signed average. You can also combine the two by XORing with 7F in order to do a signed average with rounding down.

Note that in cases where an arithmetic right shift of the right size is available, it may be better to do the carry save trick I mentioned a while back: (a + b) >> 1 = (a & b) + ((a ^ b) >> 1) or (a + b + 1) >> 1 = (a | b) - ((a ^ b) >> 1). It's your only real option if you need to do averaging with just MMX, where you don't have pavgb.

Phaeron - 02 09 08 - 03:15