¶SIMD averaging without SIMD hardware
If you need to average two sets of bitfields using larger word operations, there is a trick derived from a hardware algorithm for constructing adders that can come in handy. The basic comes from the design of a carry-save adder, which splits apart carry and sum signals to allow multiple values to be added with only one carry propagation pass at the end:
a+b = ((a & b) << 1) + (a ^ b)
(& is the C operator for bitwise AND, ^ is bitwise exclusive OR, and << is a left shift.)
To convert this into an average, propagate through the necessary shift:
(a+b) >> 1 = (a & b) + ((a ^ b) >> 1)
...and to apply this to bitfields, just mask off the least significant bits from each bitfield sum to avoid cross-field contamination:
(pel0+pel1) >> 1 = (pel0 & pel1) + (((pel0 ^ pel1) & 0xfefefefe)>>1)
This can be useful even if you have SIMD hardware that has built-in averaging. For instance, SSE has support for unsigned byte averages, but the above algorithm can be used to average four 565 pixels at a time by using the mask 0xf7def7def7def7de.
Note that the above algorithm always rounds down, which isn't always appropriate; MPEG motion prediction, for instance, requires rounding up. There is a trick that can be used to average up without using any additional operations.
The trick is to recognize that the addition can also be formulated using inclusive OR and subtraction:
a+b = ((a & b) << 1) + (a^b) = ((a|b) << 1) - (a^b)
Shifting out the LSB from the exclusive OR result then has the effect of rounding up instead of rounding down, giving:
(a+b+1) >> 1 = (a|b) - ((a^b)>>1)
(pel0+pel1+1)>>1 = (pel0 | pel1) - (((pel0 ^ pel1) & 0xfefefefe)>>1)