Current version

v1.10.4 (stable)


Main page
Archived news
Plugin SDK
Knowledge base
Contact info
Other projects


Blog Archive

SIMD intrinsics, revisited

In the previous entry, a commenter asked if SIMD intrinsics are worthwhile in VS2008.

Truth be told, I didn't try them, because Microsoft only has a skeleton crew (person?) on the C++ compiler for VS2008, and they're not even taking most bug fixes, much less a feature addition or optimization like improving SIMD code generation. The rest of the compiler team is busy rewriting the compiler for Orcas+N. As such, I don't really expect any change in intrinsics compared to VS2005 SP1, which in turn is just VS2005 RTM + some new kernel mode intrinsics. I do have some experience working with intrinsics in other venues, though, so I can at least tell you my experiences with VS2005.

The first problem you'll run into with intrinsics is that they require alignment. If you construct all of your SIMD code to use unaligned loads and stores, your performance will be pathetic. The heap alignment on Win32 is only 4 bytes, though, so you need to use _aligned_malloc(), with its associated space penalties, or switch to a custom allocator. The compiler does handle alignment of sub-objects for you, and in theory it does for stack objects as well, but my experience is that VC8 is buggy with regard to returning aligned objects and frequently gets it wrong. Fortunately, x86 gives you a clear exception when this occurs; some platforms instead helpfully align the pointer for you by zeroing LSBs of the address, which leads to some nice heap corruption bugs. If you're interoperating with .NET, you're in for some annoyance because the CLR knows jack about alignment. STL can also give you problems if its allocators aren't alignment-savvy; I think VC8's implementation might be problematic in this regard.

The second problem is MMX, or more specifically, the prohibition on mixing x87 and MMX. This isn't a performance issue -- you will actually get incorrect results if you mix the two without appropriate [F]EMMS instructions, because the FPU will start spitting out NaNs when it notices its register stack is full. VC7 had some severe bugs with the optimizer rearranging floating point calculations around _mm_empty() or __asm { emms } statements and nearly made it impossible to safely use MMX intrinsics. I think these were fixed in VC8, but then you have the problem of when to do it. The last thing you want to do is call EMMS at the end of each and every function in a library, because performance will be dreadful, and trying to document which ones use MMX and forcing the client to figure out where to put the barriers is really bad too. And if you think MMX is dead, do consider that unless you have SSE2, it's really hard to efficiently handle integers, even if you just want to convert them to and from floats (well, unless you only want to do one at a time and only 32-bit integers).

The third problem is the ABI. More specifically, the x86 ABI wasn't designed with SIMD in mind, so it has none of the features that would help. The stack isn't aligned, so the compiler has to generate code to create an aligned stack frame -- although I've heard that LTCG can help in this regard by eliminating this in nested calls. Perhaps more annoying is that there is no convention for preserving SSE registers or passing floats in SSE registers, so the compiler tends to bounce values out to memory and possibly through the x87 stack, even if /arch:SSE is used. This is especially distressing if you're writing a math library -- which you would think is a natural use for SSE intrinsics -- until you discover that the vector and float portions of the compiler don't talk to well to each other.

The fourth problem that I have with VC's intrinsics is that I sometimes find them harder to use -- x = _m_paddw(x, y) isn't much better than PADDW x, y, and I find the _mm_epi32_add() style particularly ugly. I've seen intrinsics code that looked like it was just translated line-by-line from assembly code, which basically just meant it was slower and uglier. They get more usable if you wrap them in operators, but then you end up with lots of function calls that impede debugging and make your debug builds suck. And isn't it supposed to be the compiler's job to wrap instructions in a higher level form??

I should note that the x64 versions of Windows avoid a number of these issues, as the platform is guaranteed to support SSE2 and the ABI was designed with that in mind. However, with x64 being very poorly supported and Microsoft trying its best to drive it into the ground with stupidity like the signed driver requirement in Vista x64, I've almost written it off entirely.

Truth be told, I'd love to ditch assembly and use intrinsics, but I find it hard to tolerate these flaws. SIMD makes the most difference in code that is performance critical and that means it's also the code that can least tolerate flaws in the compiler's output. I also tend to run into non-SIMD issues whenever I consider the switch, because there are a lot of missing scalar intrinsics. For instance, in a lot of my scaling code I use 32:32 fixed point, where the 32-bit halves are joined by the carry flag and thus I can use the upper half directly without needing shift ops. C++ doesn't have support for the carry flag and VC++'s __int64 code generation sucks (why would you change <<32 into *2^32???). External precision arithmetic is also very difficult to do with the provided intrinsics, to the point that I had to write a silly three-line assembly routine in an .asm file just to do MulDiv64() on x64. It seems like any new scalar intrinsics are being added just for the NT kernel team and not really for anyone else -- the new intrinsics that were added in VS2005 SP1, for instance, are essentially useless in user mode.

As a side note, when I tried Intel C++ 6.0, it did generate very nicely optimized MMX code, but it also bloated code by about 30%. In the end, I gave up supporting it because I was tired of tracking down compiler-induced bugs like thrown exception objects being destroyed twice and misgenerated STL code. I haven't tried GCC yet... it probably would do somewhere between VC++ and Intel C++ codegen-wise and probably more stably than Intel C++. Sadly, it's also hands-down the most annoying compiler on the planet.


This blog was originally open for comments when this entry was first posted, but was later closed and then removed due to spam and after a migration away from the original blog software. Unfortunately, it would have been a lot of work to reformat the comments to republish them. The author thanks everyone who posted comments and added to the discussion.