¶Weighted averaging in SSE (part 2)
Last time I talked about a faster way to do parallel averaging between 8-bit components in SSE with unequal weights, specifically the [1 7]/8 case. This time, I'll improve on the [1 7] case and also show how to do the [3 5] case.
To recap, the idea centers around using the SSE pavgb instruction to stay within bytes by rounding off LSBs correctly each time we introduce a new intermediate result with an extra significant bit. Previously, I did this by converting a subtraction to an addition and using a correction factor. It turns out that there's another way to do this that doesn't require a correction factor and requires fewer constants, which means more temp registers available.
Recall that for the [1 7] case, we are trying to compute the expression:
r = round((a*7 + b)/8) = (a*7 + b + 4) >> 3
Given 8-bit input values, this gives a 11-bit intermediate result which must be rounded. Previously, I converted round((a*7 + b)/8) to (a + round((b-a)/8)), but it turns out that the first form is better. The reason why is that it can be decomposed as follows:
round((a*7+b)/8)
= (a*7 + b + 4) >> 3
= (a*4 + a*2 + a + b + 4) >> 3
= (((((a + b) >> 1) + a) >> 1) + a + 1) >> 1
= average_round_up(average_round_down(average_round_down(a, b), a), a)
average_round_up() is, of course, simply pavgb. The average_round_down() is a bit more tricky. Ordinarily you could do this by add + shift, but we can't do that because the add gives us a 9-bit result and we'd immediately lose one of the bits. There is, unfortunately, no variant of pavgb that rounds down, since it was introduced to speed up MPEG motion compensation and that specifies rounding up on 0.5. What we can do, however, is use a sort of mutant of DeMorgan's Theorem and complement both inputs and outputs to pavgb, which converts its natural round-up behavior to round-down:
average_round_down(a, b) = ~average_round_up(~a, ~b) = ~pavgb(~a, ~b)
This might not seem advantageous since it adds three additional operations, but if we fold it into the last expression, some of the complements cancel and there are some common subexpressions:
average_round_up(average_round_down(average_round_down(a, b), a), a)
= pavgb(average_round_down(~pavgb(~a, ~b), a), a)
= pavgb(~pavgb(~~pavgb(~a, ~b), ~a), a)
= pavgb(~pavgb(pavgb(~a, ~b), ~a), a)
Now we can write the SSE form:
pcmpeqb mm7, mm7
movd mm0, a
movd mm1, b
movq mm2, mm0
pxor mm0, mm7
pxor mm1, mm7
pavgb mm1, mm0
pavgb mm1, mm0
pxor mm1, mm7
pavgb mm1, mm2
Benchmarks show that this is faster than the previous method, but not by much (<10%). However, it only requires one constant register and three temporaries, so it is much easier to unroll this by 2x on x86 and 4x on x64 for a slight gain. And, of course, it's trivial to widen to 128-bit SSE2 operations.
Now, what about the [3 5] case? Well, it's basically the same thing, but the order of additions is a bit different:
round((a*5+b*3)/8)
= ((a*5 + b*3) + 4) >> 3
= (a*4 + a + b*2 + b + 4) >> 3
= (((((a + b) >> 1) + b) >> 1) + a + 1) >> 1
= average_round_up(average_round_down(average_round_down(a, b), b), a)
Therefore, we can do the [3 5] case just by switching around which input is dominant in the second average:
pcmpeqb mm7, mm7
movd mm0, a
movd mm1, b
movq mm2, mm0
pxor mm0, mm7
pxor mm1, mm7
pavgb mm0, mm1
pavgb mm0, mm1
pxor mm0, mm7
pavgb mm0, mm2
In fact, if you think about it a little bit, this method can actually be generalized to any weighted sum where the factors add up to a power of two. It doesn't work if we have to add more than two values at any bit position, but because the weights add to a power of two, this never happens. Therefore, this method can be adapted to other useful weightings for chroma subsampling conversion, notably the 1/4, 3/4 case.