¶Troubles with _mm_loadl_epi64()
Alright, who was the dork who designed this SSE2 compiler intrinsic:
__m128i _mm_loadl_epi64(__m128i const *p);
What does this intrinsic do? It loads a 64-bit integer value from memory and stores it into the low 64 bits of an XMM register, zeroing the upper 64 bits. It's the compiler intrinsic version of the MOVQ instruction. MOVQ is fairly important for image processing routines in SSE2 for a couple of reasons: it's very convenient to process 64 bits of data, since eight 8-bit samples can be loaded and expanded as 16-bit words in a 128-bit register, and 128-bit memory accesses can't be misaligned like 64-bit memory accesses can.
Anyway, I ran into this while porting my old AP-922 based IDCT routine to intrinsics in order to recompute the constants according to a tip I'd found in a whitepaper (folding column rounding into row pass, genetic algorithm to tune... don't ask). I figured, hey... maybe I'll try intrinsics again... couldn't hurt, right? Visual C++ tends not to do well with MMX intrinsics, i.e. it misgenerates code, so I first emulated the MMX instructions with scalar code. When that worked, I tried rewriting the wrappers with SSE2 for speed.
Only to have the routine utterly and completely blow up.
Turns out that VS2005 SP1 has a bad bug with the above intrinsic that causes it to horribly screw up code generation -- the compiler actually generated all of the MOVQ instructions backwards, leading to code like this:
paddd xmm1,xmm2
movq xmm1,xmm3 ;should be xmm3, xmm1
movq xmm1,mmword ptr [eax+8]
I submitted a bug on this, and of course, the response I got back was of the form: "we're not planning on fixing this for VS2005, but hey, you can buy VS2008." Thanks, but no, I'm not going to fork out hundreds of dollars for a release that gives me little except bug fixes. Oh, and though its code was correct, VS2008 still generated lots of useless code fragments like this:
movdqa xmm1,xmm1
movdqa xmm7,xmm7
movdqa xmm6,xmm6
Now, after having spent hours diagnosing and futilely trying to work around this bug in VS2005, I finally got tired of the state of Microsoft's code generation and decided to try GCC, which reportedly has much better support for vector intrinsics. So, after a lot of searching around to find a Win32 build of GCC 4 I could use without installing half of Cygwin and a bit of swearing to get all of the environment variables and paths set up, I got the app compiled under GCC 4.
Then I built it under -O3, and watched it output garbage.
It turns out that the _mm_loadl_epi64() intrinsic is evil for another reason. The MOVQ xmmreg, m64 instruction loads a 64-bit word, aligned or unaligned. Note, however, that the intrinsic takes a const _m128i * pointer, which would normally require an aligned 128 bit memory location. In order to use this intrinsic as intended you have to cast the pointer, and it turns out that doing so runs afoul of C++'s type aliasing rules, which then causes GCC 4 to generate broken code. When I inspected the disassembly, I eventually figured out that an entire multiply chain had disappeared due to this. Working around this is a huge pain in the butt (union), so in the end I just used -fno-strict-aliasing to get the code working again. In the end, it didn't matter much either way, because although Visual C++ loves orgies with register-to-register moves, GCC ended up wasting a bunch of cycles doing memory spills and movq2dq instructions, apparently unable to realize that only the low 64 bits were ever used. The code didn't run any faster.
_mm_storel_epi64() has similar problems, by the way.
Epic debugging sessions like this are the reason why I get so annoyed when people tell me I should stop writing assembly and switch to intrinsics. The fact is that even when the intrinsics actually do what I want and don't end up slower than even the original scalar code I had, every single time I've tried to use intrinsics I've gotten burnt, without exception.