¶Compiler intrinsics, revisited
I received an email recently from a member of the Microsoft Visual C++ compiler team who is working on the AMD64 compiler, regarding my comments about intrinsics support in the VC++ compiler. Given my past feedback on this blog and in the MSDN Product Feedback Center on the quality of the intrinsics in VC++, one of two possibilities was possible:
- I had mortally offended the Visual C++ compiler team and had received a notice to appear in Redmond for a formal challenge to the death; or
- They wanted to inform me of significant improvements made to the compiler in the Visual Studio .NET 2005 "Whidbey" public beta.
Fortunately, the team member turned out to be a nice guy and informed me that intrinsics support had indeed been improved in Whidbey.
To review, compiler intrinsics are psuedo-functions that expose CPU functionality that doesn't fit well into C/C++ constructs. Simple operations like add and subtract map nicely to + and -, but four-way-packed-multiply-signed-and-add-pairs doesn't. Instead, the compiler exposes a __m64 type and _m_pmaddwd() psuedo-function that you can use. In theory, you get the power and speed of specialized CPU primitives, with some of the portability benefits of using C/C++ over straight assembly language. The problem in the past was that Visual Studio .NET 2003 and earlier generated poor code for these primitives that was either incorrect or slower than which could be written straight in assembly language with moderate effort.
The good news
Here's the routine using SSE2 intrinsics that I used to punish the compiler last time I wrote about this problem:
#include <emmintrin.h> unsigned premultiply_alpha(unsigned px) { __m128i px8 = _mm_cvtsi32_si128(px); __m128i px16 = _mm_unpacklo_epi8(px8, _mm_setzero_si128()); __m128i alpha = _mm_shufflelo_epi16(px16, 0xff); __m128i result16 = _mm_srli_epi16(_mm_mullo_epi16(px16, alpha), 8); return _mm_cvtsi128_si32(_mm_packus_epi16(result16, result16)); }
Here's what Visual Studio .NET 2003 generates for this function:
pxor xmm0, xmm0 movdqa xmm1, xmm0 movd xmm0, ecx punpcklbw xmm0, xmm1 pshuflw xmm1, xmm0, 255 pmullw xmm0, xmm1 psrlw xmm0, 8 movdqa xmm1, xmm0 packuswb xmm1, xmm0 movd eax, xmm1 ret
Note the unnecessary movdqa instructions; these are expensive on Pentium 4, where each one adds 6 clocks to your dependency chain.
Here's what Visual Studio .NET 2005 generates for this function:
pxor xmm1, xmm1 movd xmm0, ecx punpcklbw xmm0, xmm1 pshuflw xmm1, xmm0, 255 pmullw xmm0, xmm1 psrlw xmm0, 8 packuswb xmm0, xmm0 movd eax, xmm0 ret
That's actually not too bad. Fairly good, even.
The bad news
The last time I did this test, I posted the following result for Visual Studio .NET 2003:
push ebp mov ebp, esp pxor xmm0, xmm0 movdqa xmm1, xmm0 movd xmm0, DWORD PTR _px$[ebp] punpcklbw xmm0, xmm1 pshuflw xmm1, xmm0, 255 pmullw xmm0, xmm1 psrlw xmm0, 8 movdqa xmm1, xmm0 packuswb xmm1, xmm0 and esp, -16 movd eax, xmm1 mov esp, ebp pop ebp ret 0
The reason for the discrepancy is that I cheated in the tests above by using the /Gr compiler switch to force the __fastcall calling convention. Part of the problem with the VC++ intrinsics is that they have a habit of forcing an aligned stack frame if any stack parameters are accessed in a function that uses intrinsics, even if no aligned parameters are necessary. This is unfortunate as it slows down the prolog/epilog and eats an additional register. Sadly, this is not fixed in Whidbey, although it is a moot point on AMD64 where the stack is always 16-byte aligned. Using the fastcall convention can fix this on x86 if all parameters can be pushed to registers, but this isn't possible if you have more than 8 bytes of parameters.
The other bad news is that the MMX instrinsics still produce awful code, although this is only pertinent to x86 since the AMD64 compiler doesn't support MMX instructions, and at least the bugs with MMX code moving past floating-point or EMMS instructions have been fixed:
pxor mm1, mm1 movd mm0, ecx punpcklbw mm0, mm1 movq mm1, mm0 movq mm2, mm0 punpckhwd mm1, mm2 movq mm2, mm1 punpckhwd mm2, mm1 pmullw mm0, mm2 psrlw mm0, 8 movq mm1, mm0 packuswb mm1, mm0 movd eax, mm1 emms ret
Conclusions
The aligned stack frame is a bummer for codelet libraries, but it isn't so big of a deal if you can isolate intrinsics code into big, long-duration functions and the function isn't under critical register pressure. The improvements to SSE2 intrinsics code generation make them more attractive in Whidbey, but since AMD64 is not widespread and SSE2 is only supported on Pentium 4, Pentium M, and Athlon 64 make them unusable for mainstream code on x86. They're also rather difficult to read compared to assembly code. I still don't think I'd end up using them even after Whidbey ships, because it would make my x86 and AMD64 codebases diverge farther without much gain.
Another problem is that although all SSE2 instructions are available through intrinsics, and many non-vector intrinsics have been added in Whidbey, there are still a large number of tricks that can only be done directly in assembly language, many of which involve extended-precision arithmetic and the carry flag. The one that I use all the time is the split 32:32 fixed-point accumulator, where two 32-bit registers hold the integer and fractional parts of a value. This is very frequently required in scaling and interpolation routines. The advantage is that you can get to the integer portion very quickly. In x86:
add ebx, esi adc ecx, edi mov eax, dword ptr [ecx*4]
In AMD64 you can sometimes get away with half the registers if you only need a 32-bit result, by swapping the low and high halves and wrapping the carry around:
add rbx, rcx adc rbx, 0 mov [rdx], ebx
Compiler intrinsics don't let you do this.
Another problem I run into often in routines that are MMX or SSE2 heavy is a critical shortage of general purpose registers, usually for scanline pointers, fixed-point accumulators, and counters. The way I get around this on x86 is to make use of the Structured Exception Handling (SEH) chain to temporarily hold the stack pointer, freeing it for use as an eighth general purpose register:
push 0 push fs:dword ptr [0] mov fs:dword ptr [0], esp ... mov esp, fs:dword ptr [0] pop fs:dword ptr [0] pop eax
...and then be really careful not to cause an exception while within the block.
This allows a routine to use all eight registers and still be reentrant. It's probably unreasonable to expect a compiler to generate code like this, though.