¶Auto-vectorization in the Visual Studio 11 Express preview
Okay, it's actually the Microsoft Visual Studio 11 Express for Windows Developer Preview, but that's a ridiculously long name. I hope they call it something like vs11ew internally.
One thing I didn't expect to see in the VC11 compiler is auto-vectorization:
http://msdn.microsoft.com/en-us/library/dd547188%28v=VS.110%29.aspx#BKMK_VCPP
This attempts to produce vectorized code by analyzing your scalar loops. Now, this isn't going to do miracles -- particularly with poor support in C/C++ for alignment -- and you'll still have to go to intrinsics or assembly for fastest code. However, the advantage of auto-vectorization is that the compiler can still do it when you're lazy -- which is great when you're prototyping, and can help in code you can't afford to focus on. As I've said before, I don't consider intrinsics to be very readable and it's been long since I considered manual register allocation fun, so even though I wouldn't want to have to rely on auto-vectorization I'm still in favor of it.
After doing some testing with the x86 compiler (17.00.40825.2), the first thing I can say is that at least with this early implementation you probably won't be relying on auto-vectorization for video or image processing code. I was not able to get the compiler to vectorize any code processing 8-bit or 16-bit integers. The only types I was able to vectorize with were 32-bit integers, 64-bit integers, floats, and doubles, and that excludes a huge amount of decoding/encoding/filtering code. In order to do this the target CPU needs to support SSE for floats and SSE2 for ints or doubles; however, the developer preview compiler is pretty broken and I was often able to get it to generate SSE or SSE4.1 instructions inappropriately. For now we'll overlook that and just look at the operations that it can vectorize. For ints, I was able to get these operations to vectorize:
- Addition (paddd/paddq), subtraction (psubd/psubq), 32-bit multiplication (pmulld)
- Bitwise and (pand), or (por), and xor (pxor)
- Left shift (pslld/psllq), unsigned right shift (psrld/psrlq), and signed right shift (psrad/psraq) by a constant
64-bit ints don't work very well -- x+y vectorizes while x+1 doesn't. Inversion (~) didn't work, and surprisingly, neither did negation (unary minus), so 0-x runs better than -x. Probably the most disappointing is that neither conditionals nor relationals vectorize, so writing branchless mask based code isn't possible. I couldn't get min/max or masked writes out of it, either.
For floats, more operations are supported:
- Addition (addps), subtraction (subps), multiplication (mulps), division (divps)
- Square root (sqrtps)
- Reciprocal square root (1/sqrtf(x) -> rsqrtps + refinement) (!)
- Cast to and from int
Unary minus, fmodf(), fabsf(), transcendentals, min/max, and relational ops failed. I got float-to-unsigned casts to vectorize, but the generated code was bad (truncated all numbers above 2^31). The auto-vectorization is thus more powerful with floats, but there are still noticeable holes in operations support.
Another issue with the current auto-vectorization implementation is that it universally emits unaligned loads and stores (movups/movdqu). I tried copying to a local array with forced alignment, but even that wasn't enough to get movaps. That's an easy gain for intrinsics/asm over the auto-vectorizer, unfortunately. It does, however, emit code that is aliasing tolerant: it checks whether the destination and source arrays overlap and branches to either vectorized or unrolled code depending on the result. __restrict wasn't effective in removing the check.
The third problem with the auto-vectorizer is that currently you can't turn it off by itself, only by reducing the global optimization level. This means a significant amount of code bloat with full optimization even if the vectorized code will never run (cases of guaranteed partial overlap). It also makes the developer preview a bit fragile since it means you can't easily escape the code generation bugs in the vectorizer. Hopefully there will be ways to control the auto-vectorizer like the inliner (command line switch + pragmas).
Anyway, it'll be interesting to see how this evolves. After Visual Studio .NET 2002, my general rule is that you should assume everything in a public Visual Studio beta is as it will ship unless it's already known to be changing, enough people complain about it, or it's clearly a showstopper. The level of codegen bugs in this compiler version is a lot higher than usual, though, so I have to assume this is earlier in the development cycle (or else the compiler team is in trouble!).