§ ¶SIMD averaging without SIMD hardware
If you need to average two sets of bitfields using larger word operations, there is a trick derived from a hardware algorithm for constructing adders that can come in handy. The basic comes from the design of a carry-save adder, which splits apart carry and sum signals to allow multiple values to be added with only one carry propagation pass at the end:
a+b = ((a & b) << 1) + (a ^ b)
(& is the C operator for bitwise AND, ^ is bitwise exclusive OR, and << is a left shift.)
To convert this into an average, propagate through the necessary shift:
(a+b) >> 1 = (a & b) + ((a ^ b) >> 1)
...and to apply this to bitfields, just mask off the least significant bits from each bitfield sum to avoid cross-field contamination:
(pel0+pel1) >> 1 = (pel0 & pel1) + (((pel0 ^ pel1) & 0xfefefefe)>>1)
This can be useful even if you have SIMD hardware that has built-in averaging. For instance, SSE has support for unsigned byte averages, but the above algorithm can be used to average four 565 pixels at a time by using the mask 0xf7def7def7def7de.
Note that the above algorithm always rounds down, which isn't always appropriate; MPEG motion prediction, for instance, requires rounding up. There is a trick that can be used to average up without using any additional operations.
The trick is to recognize that the addition can also be formulated using inclusive OR and subtraction:
a+b = ((a & b) << 1) + (a^b) = ((a|b) << 1) - (a^b)
Shifting out the LSB from the exclusive OR result then has the effect of rounding up instead of rounding down, giving:
(a+b+1) >> 1 = (a|b) - ((a^b)>>1)
(pel0+pel1+1)>>1 = (pel0 | pel1) - (((pel0 ^ pel1) & 0xfefefefe)>>1)
Comments
Comments posted:
cool!
in the last line, you mean:
(pel0+pel1+1)>>1 = (pel0 | pel1) ***MINUS*** (((pel0 ^ pel1) & 0xfefefefe)>>1)
User - 06 07 06 - 04:45
How about compiling something like HAKMEM 2... :)
[maven] (link) - 06 07 06 - 06:57
@User:
Whoops, fixed. Thanks!
Phaeron - 06 07 06 - 23:34
Any smart-math way of doing an 8bit alpha blend (8bit source, 8bit destination, 8bit alpha)?
Blight - 07 07 06 - 01:04
Isn't it quite senseless to program without doing it 'the SIMD way'?
I mean, basicly every processor in existence today has some form of SIMD instruction set.
/Zaro
Zaro - 07 07 06 - 04:52
@Zaro, cleaverness like this is probably even more appropriate in SIMD processing. Think harder about the 565 pixel example Avery gave.
@Blight, I take it you, like me, want an improved way to do
Dst += (Src - Dst) * Alpha / 255;
which in regular code could be implemented quickish as
Dst += LookupTable[(Src-Dst)
IanB - 07 07 06 - 22:33
Doh!, Smartarse blog parser! Take 2
*Dst += LookupTable[(Src-Dst) ._lshift_. 8 | alpha]; // 17 bit table*
Which doesn't play well in SIMD.
Thoughts anybody?
IanB - 07 07 06 - 22:42
@Zaro:
You're assuming that the SIMD instruction set natively supports the datatypes you are working with. MMX and SSE are rather rigid in this regard, with some support for unsigned bytes, a lot of support for signed words, and very sketchy support for everything else. Conversion to and from signed words tends to put a lot of pressure on the shift unit, so to the extent that you can find alternate ways to work with odd bitfields, the better performance you can get. I hear that Altivec is a lot nicer with regard to data type and operation orthogonality, but it still isn't going to support every combination of packed data type that you will encounter in the wild.
Phaeron - 08 07 06 - 00:07