Compiler intrinsics, revisited

¶Compiler intrinsics, revisited

I received an email recently from a member of the Microsoft Visual C++ compiler team who is working on the AMD64 compiler, regarding my comments about intrinsics support in the VC++ compiler. Given my past feedback on this blog and in the MSDN Product Feedback Center on the quality of the intrinsics in VC++, one of two possibilities was possible:

I had mortally offended the Visual C++ compiler team and had received a notice to appear in Redmond for a formal challenge to the death; or
They wanted to inform me of significant improvements made to the compiler in the Visual Studio .NET 2005 "Whidbey" public beta.

Fortunately, the team member turned out to be a nice guy and informed me that intrinsics support had indeed been improved in Whidbey.

To review, compiler intrinsics are psuedo-functions that expose CPU functionality that doesn't fit well into C/C++ constructs. Simple operations like add and subtract map nicely to + and -, but four-way-packed-multiply-signed-and-add-pairs doesn't. Instead, the compiler exposes a __m64 type and _m_pmaddwd() psuedo-function that you can use. In theory, you get the power and speed of specialized CPU primitives, with some of the portability benefits of using C/C++ over straight assembly language. The problem in the past was that Visual Studio .NET 2003 and earlier generated poor code for these primitives that was either incorrect or slower than which could be written straight in assembly language with moderate effort.

The good news

Here's the routine using SSE2 intrinsics that I used to punish the compiler last time I wrote about this problem:

#include <emmintrin.h>

unsigned premultiply_alpha(unsigned px) {
 __m128i px8 = _mm_cvtsi32_si128(px);
 __m128i px16 = _mm_unpacklo_epi8(px8, _mm_setzero_si128());
 __m128i alpha = _mm_shufflelo_epi16(px16, 0xff);

 __m128i result16 = _mm_srli_epi16(_mm_mullo_epi16(px16, alpha), 8);

 return _mm_cvtsi128_si32(_mm_packus_epi16(result16, result16));
}

Here's what Visual Studio .NET 2003 generates for this function:

pxor    xmm0, xmm0
movdqa  xmm1, xmm0
movd    xmm0, ecx
punpcklbw xmm0, xmm1
pshuflw xmm1, xmm0, 255
pmullw  xmm0, xmm1
psrlw   xmm0, 8
movdqa  xmm1, xmm0
packuswb xmm1, xmm0
movd    eax, xmm1
ret

Note the unnecessary movdqa instructions; these are expensive on Pentium 4, where each one adds 6 clocks to your dependency chain.

Here's what Visual Studio .NET 2005 generates for this function:

pxor    xmm1, xmm1
movd    xmm0, ecx
punpcklbw xmm0, xmm1
pshuflw xmm1, xmm0, 255
pmullw  xmm0, xmm1
psrlw   xmm0, 8
packuswb xmm0, xmm0
movd    eax, xmm0
ret

That's actually not too bad. Fairly good, even.

The bad news

The last time I did this test, I posted the following result for Visual Studio .NET 2003:

push    ebp
mov     ebp, esp
pxor    xmm0, xmm0
movdqa  xmm1, xmm0
movd    xmm0, DWORD PTR _px$[ebp]
punpcklbw xmm0, xmm1
pshuflw xmm1, xmm0, 255
pmullw  xmm0, xmm1
psrlw   xmm0, 8
movdqa  xmm1, xmm0
packuswb xmm1, xmm0
and     esp, -16
movd    eax, xmm1
mov     esp, ebp
pop     ebp
ret     0

The reason for the discrepancy is that I cheated in the tests above by using the /Gr compiler switch to force the __fastcall calling convention. Part of the problem with the VC++ intrinsics is that they have a habit of forcing an aligned stack frame if any stack parameters are accessed in a function that uses intrinsics, even if no aligned parameters are necessary. This is unfortunate as it slows down the prolog/epilog and eats an additional register. Sadly, this is not fixed in Whidbey, although it is a moot point on AMD64 where the stack is always 16-byte aligned. Using the fastcall convention can fix this on x86 if all parameters can be pushed to registers, but this isn't possible if you have more than 8 bytes of parameters.

The other bad news is that the MMX instrinsics still produce awful code, although this is only pertinent to x86 since the AMD64 compiler doesn't support MMX instructions, and at least the bugs with MMX code moving past floating-point or EMMS instructions have been fixed:

pxor      mm1, mm1
movd      mm0, ecx
punpcklbw mm0, mm1
movq      mm1, mm0
movq      mm2, mm0
punpckhwd mm1, mm2
movq      mm2, mm1
punpckhwd mm2, mm1
pmullw    mm0, mm2
psrlw     mm0, 8
movq      mm1, mm0
packuswb  mm1, mm0
movd      eax, mm1
emms
ret

Conclusions

The aligned stack frame is a bummer for codelet libraries, but it isn't so big of a deal if you can isolate intrinsics code into big, long-duration functions and the function isn't under critical register pressure. The improvements to SSE2 intrinsics code generation make them more attractive in Whidbey, but since AMD64 is not widespread and SSE2 is only supported on Pentium 4, Pentium M, and Athlon 64 make them unusable for mainstream code on x86. They're also rather difficult to read compared to assembly code. I still don't think I'd end up using them even after Whidbey ships, because it would make my x86 and AMD64 codebases diverge farther without much gain.

Another problem is that although all SSE2 instructions are available through intrinsics, and many non-vector intrinsics have been added in Whidbey, there are still a large number of tricks that can only be done directly in assembly language, many of which involve extended-precision arithmetic and the carry flag. The one that I use all the time is the split 32:32 fixed-point accumulator, where two 32-bit registers hold the integer and fractional parts of a value. This is very frequently required in scaling and interpolation routines. The advantage is that you can get to the integer portion very quickly. In x86:

add ebx, esi
adc ecx, edi
mov eax, dword ptr [ecx*4]

In AMD64 you can sometimes get away with half the registers if you only need a 32-bit result, by swapping the low and high halves and wrapping the carry around:

add rbx, rcx
adc rbx, 0
mov [rdx], ebx

Compiler intrinsics don't let you do this.

Another problem I run into often in routines that are MMX or SSE2 heavy is a critical shortage of general purpose registers, usually for scanline pointers, fixed-point accumulators, and counters. The way I get around this on x86 is to make use of the Structured Exception Handling (SEH) chain to temporarily hold the stack pointer, freeing it for use as an eighth general purpose register:

push 0
push fs:dword ptr [0]
mov fs:dword ptr [0], esp
...
mov esp, fs:dword ptr [0]
pop fs:dword ptr [0]
pop eax

...and then be really careful not to cause an exception while within the block.

This allows a routine to use all eight registers and still be reentrant. It's probably unreasonable to expect a compiler to generate code like this, though.

7 comments | Apr 16, 2005 at 16:37 | default

Current version

Navigation

Archives

¶Compiler intrinsics, revisited

Comments