¶Weighted averaging in SSE
I hereby declare today stupid assembly tricks day.
I'm working on optimizing some conversion code, and one of the tasks I need to do is quickly resample an 8-bit chroma plane from a 4:1:0 image up to 4:4:4, essentially removing the subsampling. This is a 4x enlargement in both horizontal and vertical directions. For the vertical direction, the scanlines of the 4:4:4 image fall between the chroma scanlines of the 4:1:0 image such that I need four filter phases:
- 7/8, 1/8
- 5/8, 3/8
- 3/8, 5/8
- 1/8, 7/8
This can be handled with a generic routine, but the question here is whether this can be done more efficiently with specialized routines. Looking at the above, the list is symmetrical, so we can immediately discard two of the entries: the bottom two filters can be implemented in terms of the top two filters by swapping the scanline inputs. That leaves two filters, a [7 1]/8 and a [5 3]/8.
Now, I'm dealing with 8-bit components here, and the most straightforward way of doing this is a good old fashioned weighted sum. In MMX, this can be done as follows:
movd mm0, [ecx]
movd mm1, [edx]
punpcklbw mm0, mm7
punpcklbw mm1, mm7
pmullw mm0, weight1
pmullw mm1, weight2
paddw mm0, mm1
paddw mm0, round
psraw mm0, FRACTIONAL_BITS
packuswb mm0, mm0
movd [ebx], mm0
This computes four samples in 11 instructions. It can easily be generalized to SSE2 simply by doubling the register widths. This is the baseline to beat.
So how do we speed this up?
The weighted sum itself is already decent. In SSE2, you can sometimes remove one of the additions by interleaving components and using PMADDWD instead of PMULLW x 2 + PADDW, but you then have an extra pack step from dword to word, and so I generally only do that if I need extra precision. Unpacking and packing is painful because it needs additional register space and it stresses the pack/shift unit, which is in short supply compared to addition units. No, in order to beat this significantly, we need to avoid the unpack / pack cycle, and that means working directly on bytes. Unfortunately, working directly on bytes in MMX/SSE has a few problems:
- There are no psrlb, psllb, or psrab instructions, thus extra masking operations are required.
- pcmpgtb only works on signed bytes.
- There are no multiply operations that work on bytes until SSSE3 (pmaddubsw).
- The SSE pavgb instruction only works on unsigned bytes.
If you're confused about the last point, the reason why I bring it up has to do with precision issues in intermediate computations. In particular, if you add or subtract two full-range bytes (8 bits), you gain an extra significant bit and the result is 9 bits. Unless you use a widening operation like pmaddwd or pmaddubsw, you're going to lose a bit either on the high end (paddb/psubb), or on the low end (pavgb). Now, the trick here is that you can lose an LSB on one intermediate result and still come out with a correctly rounded answer, but lose a bit from two intermediate values, and you've usually lost accuracy. The trick I usually use is to convert (a*(1-x) + b*x) to (a + (b-a)*x), but (b-a) is a 9-bit signed result and that's kind of hard to deal with.
Fortunately, there's a way around this, at least for the [7 1] case. I've found that it helps to think more like a hardware engineer in cases like this: you need to concentrate on what the operations actually do and not what they're intended to be used for.
First, rewrite round((a*7 + b)/8) to (a + ((b-a+4) >> 3)). Now, we'd like to use pavgb here, but pavgb computes an addition ((a+b+1)>>1) and we need a subtraction. What we can do, however, is change addition into subtraction by an algorithm for 8-bit negation: -x == ~x + 1 - 0x100. Therefore, we can rewrite as follows:
(b - a + 4) >> 3
= (b + ~a + 5 - 0x100) >> 3
= ((b + ~a + 5) >> 3) - 0x20
Next, we decompose the operation into individual bit shifts:
((b + ~a + 5) >> 3) - 0x20
= ((((b + ~a + 1) >> 1) + 2) >> 2) - 0x20
= (((((b + ~a + 1) >> 1) >> 1) + 1) >> 1) - 0x20
This probably looks a bit crazy, but there is a reason for the madness, and that is that we can repeatedly convert ((a + b + 1) >> 1) to pavgb(a, b):
(((((b + ~a + 1) >> 1) >> 1) + 1) >> 1) - 0x20
= (((pavgb(b, ~a) >> 1) + 1) >> 1) - 0x20
= pavgb(pavgb(b, ~a) >> 1, 0) - 0x20
The result, translated to assembly:
movq mm0, [ecx]
movq mm1, [ebx]
movq mm2, mm0
pxor mm0, xFFFFFFFFFFFFFFFF
pavgb mm0, mm1
pand mm0, xFEFEFEFEFEFEFEFE
psrlq mm0, 1
pavgb mm0, x0000000000000000
paddb mm2, xE0E0E0E0E0E0E0E0
paddb mm0, mm2
movq [edx], mm0
Note that while this is also 11 instructions, it processes 8 samples at a time instead of 4, and many of the instructions have lower latencies and fewer execution port restrictions than the unpack/mul/add/pack solution. Quick benchmarking shows the byte-oriented solution to be 40% faster, although admittedly for a really good comparison I'd need to convert both routines to SSE2 and double the number of pixels handled per loop. I think I can eliminate one of the constants by moving the complement and replacing pand + psrlq with another pavgb, but I haven't looked into that yet.
I haven't figured out how to handle the [5 3] case yet. It looks tougher due to the factor of three, but there might still be a way to structure the evaluation order to get all the rounding right. The question is whether or not it'll be faster than the straightforward solutions.