v1.10.4 (stable)

Main page
Archived news
Documentation
Capture
Compiling
Processing
Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info

Other projects
Altirra

Blog Archive

## ¶Weighted averaging in SSE

I hereby declare today stupid assembly tricks day.

I'm working on optimizing some conversion code, and one of the tasks I need to do is quickly resample an 8-bit chroma plane from a 4:1:0 image up to 4:4:4, essentially removing the subsampling. This is a 4x enlargement in both horizontal and vertical directions. For the vertical direction, the scanlines of the 4:4:4 image fall between the chroma scanlines of the 4:1:0 image such that I need four filter phases:

• 7/8, 1/8
• 5/8, 3/8
• 3/8, 5/8
• 1/8, 7/8

This can be handled with a generic routine, but the question here is whether this can be done more efficiently with specialized routines. Looking at the above, the list is symmetrical, so we can immediately discard two of the entries: the bottom two filters can be implemented in terms of the top two filters by swapping the scanline inputs. That leaves two filters, a [7 1]/8 and a  [5 3]/8.

Now, I'm dealing with 8-bit components here, and the most straightforward way of doing this is a good old fashioned weighted sum. In MMX, this can be done as follows:

`movd mm0, [ecx]movd mm1, [edx]punpcklbw mm0, mm7punpcklbw mm1, mm7pmullw mm0, weight1pmullw mm1, weight2paddw mm0, mm1paddw mm0, roundpsraw mm0, FRACTIONAL_BITSpackuswb mm0, mm0movd [ebx], mm0`

This computes four samples in 11 instructions. It can easily be generalized to SSE2 simply by doubling the register widths. This is the baseline to beat.

So how do we speed this up?

The weighted sum itself is already decent. In SSE2, you can sometimes remove one of the additions by interleaving components and using PMADDWD instead of PMULLW x 2 + PADDW, but you then have an extra pack step from dword to word, and so I generally only do that if I need extra precision. Unpacking and packing is painful because it needs additional register space and it stresses the pack/shift unit, which is in short supply compared to addition units. No, in order to beat this significantly, we need to avoid the unpack / pack cycle, and that means working directly on bytes. Unfortunately, working directly on bytes in MMX/SSE has a few problems:

• There are no psrlb, psllb, or psrab instructions, thus extra masking operations are required.
• pcmpgtb only works on signed bytes.
• There are no multiply operations that work on bytes until SSSE3 (pmaddubsw).
• The SSE pavgb instruction only works on unsigned bytes.

If you're confused about the last point, the reason why I bring it up has to do with precision issues in intermediate computations. In particular, if you add or subtract two full-range bytes (8 bits), you gain an extra significant bit and the result is 9 bits. Unless you use a widening operation like pmaddwd or pmaddubsw, you're going to lose a bit either on the high end (paddb/psubb), or on the low end (pavgb). Now, the trick here is that you can lose an LSB on one intermediate result and still come out with a correctly rounded answer, but lose a bit from two intermediate values, and you've usually lost accuracy. The trick I usually use is to convert (a*(1-x) + b*x) to (a + (b-a)*x), but (b-a) is a 9-bit signed result and that's kind of hard to deal with.

Fortunately, there's a way around this, at least for the [7 1] case. I've found that it helps to think more like a hardware engineer in cases like this: you need to concentrate on what the operations actually do and not what they're intended to be used for.

First, rewrite round((a*7 + b)/8) to (a + ((b-a+4) >> 3)). Now, we'd like to use pavgb here, but pavgb computes an addition ((a+b+1)>>1) and we need a subtraction. What we can do, however, is change addition into subtraction by an algorithm for 8-bit negation: -x == ~x + 1 - 0x100. Therefore, we can rewrite as follows:

`  (b - a + 4) >> 3= (b + ~a + 5 - 0x100) >> 3= ((b + ~a + 5) >> 3) - 0x20`

Next, we decompose the operation into individual bit shifts:

`  ((b + ~a + 5) >> 3) - 0x20= ((((b + ~a + 1) >> 1) + 2) >> 2) - 0x20= (((((b + ~a + 1) >> 1) >> 1) + 1) >> 1) - 0x20`

This probably looks a bit crazy, but there is a reason for the madness, and that is that we can repeatedly convert ((a + b + 1) >> 1) to pavgb(a, b):

`  (((((b + ~a + 1) >> 1) >> 1) + 1) >> 1) - 0x20= (((pavgb(b, ~a) >> 1) + 1) >> 1) - 0x20= pavgb(pavgb(b, ~a) >> 1, 0) - 0x20`

The result, translated to assembly:

`movq  mm0, [ecx]movq  mm1, [ebx]movq  mm2, mm0pxor  mm0, xFFFFFFFFFFFFFFFFpavgb mm0, mm1pand  mm0, xFEFEFEFEFEFEFEFEpsrlq mm0, 1pavgb mm0, x0000000000000000paddb mm2, xE0E0E0E0E0E0E0E0paddb mm0, mm2movq  [edx], mm0`

Note that while this is also 11 instructions, it processes 8 samples at a time instead of 4, and many of the instructions have lower latencies and fewer execution port restrictions than the unpack/mul/add/pack solution. Quick benchmarking shows the byte-oriented solution to be 40% faster, although admittedly for a really good comparison I'd need to convert both routines to SSE2 and double the number of pixels handled per loop. I think I can eliminate one of the constants by moving the complement and replacing pand + psrlq with another pavgb, but I haven't looked into that yet.

I haven't figured out how to handle the [5 3] case yet. It looks tougher due to the factor of three, but there might still be a way to structure the evaluation order to get all the rounding right. The question is whether or not it'll be faster than the straightforward solutions.