Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
 
Other projects
   Altirra

Archives

Blog Archive

Compiler intrinsics... again

You know that episode of The Simpsons where Bart reaches for the electrified cookie jar and goes "ow," and then just keeps doing it again and again? Yeah, I'm like that with compiler intrinsics.

Let's take a simple routine:

__m128i fold1(__m128i x) {
__m128i mask = _mm_set1_epi16(0x5555);
return _mm_add_epi16(_mm_and_si128(mask, _mm_srli_epi16(x, 1)), _mm_and_si128(mask, x));
}

This is one step of a population count routine, which folds pairs of bits together into two-bit counts. (Yeah, I know this can be done better with subtraction, but popcount isn't the subject here.) Run this through VC10, and you get this:

movdqa      xmm1,xmmword ptr [__xmm@0]
movdqa      xmm2,xmm0
movdqa      xmm0,xmm1
movdqa      xmm3,xmm2
psrlw       xmm3,1
pand        xmm0,xmm3
pand        xmm1,xmm2
paddw       xmm0,xmm1
ret

Unnecessary moves blah blah blah... you've heard it here before. Then again, let's take a closer look. Why did the compiler emit the MOVDQA XMM3, XMM2 instruction? Hmm, it's because it did the shift next, but it still needed to keep "x" around for the second operation. And how about that PAND that follows? Well, it couldn't modify "mask," so it copied that too. Waaaiit a minute, it's just doing everything exactly the way I told it. That might be OK if x86 used three-argument form instructions, but since x86 is two-argument, that kinda sucks. What about if we rewrote the routine this way:

__m128i fold2(__m128i x) {
__m128i mask = _mm_set1_epi16(0x5555);
return _mm_add_epi16(_mm_and_si128(_mm_srli_epi16(x, 1), mask), _mm_and_si128(mask, x));
}

movdqa      xmm1,xmmword ptr [__xmm@0]
movdqa      xmm2,xmm0
psrlw       xmm0,1
pand        xmm0,xmm1
pand        xmm1,xmm2
paddw       xmm0,xmm1
ret

Well, that looks a bit better. It appears that Visual C++ is unable to take advantage of the fact that the binary operations used here are commutative, which means that the efficiency of the code generated can differ significantly based on the order of the arguments even though the result is the same. The upside is that you can swap around arguments to get better code; the downside is that you're doing what the code generator should be doing. Interestingly, based on some experiments it looks like the code generator can do this for scalar operations, so something didn't get hooked up or extended to the intrinsics portion.

Anyway, if you've got extra moves showing up in the disassembly when using intrinsics, try shaking the expression tree a bit and see if some of the moves fall out.

Comments

This blog was originally open for comments when this entry was first posted, but was later closed and then removed due to spam and after a migration away from the original blog software. Unfortunately, it would have been a lot of work to reformat the comments to republish them. The author thanks everyone who posted comments and added to the discussion.