¶SSE4.1, first impressions
Trimbo's been bugging me about SSE4.1 lately and my experiences with it.
Well, I've just starting playing around with it, now that I have the tools and build set up, and my experience has been mixed.
The main problem is alignment. To make good use of a Core 2+ and of SSE4.1, you have to go from mmx (64-bit) to xmm (128-bit) registers. The annoyance that comes along with this is that while 64-bit memory accesses can be unaligned, all 128-bit accesses on x86 currently have to be aligned or you fault. That means it isn't as trivial as just taking a 64-bit loop and processing pairs of pixels at a time. It's true that misaligned loads hurt performance, but there are two mitigating factors in 64-bit vector code. One is that on modern CPUs, not all misaligned accesses cause a penalty -- only ones that cross L1 cache line boundaries and trigger a DCU cache split do. This means that for sequential accesses only a quarter or less of your loads fault, which may be acceptable if the cost of avoiding the misalignment is higher. The other factor is that when processing pixels, it's common to have to expand 8-bit unsigned channels to 16-bit signed, which means that the loads are frequently 32-bit and thus may always be aligned. Going to 128-bit vectors and 64-bit loads spoils this. What this all means is that several algorithms that I looked at for rewriting in SSE4.1 looked great until I examined the load/store paths and realized that I was going to burn more time in dealing with misalignment than I would save in the faster ALU operations.
You might say that memory buffers should just be aligned, and yes, you can do that to some extent, particularly with temporary buffers. The gotcha is that you don't always control the buffers involved and at some point you simply can't control the alignment. For example, display buffers probably aren't guaranteed to be 16 byte aligned, nor would a GDI DIB. Similar problems occur at the ends of scanlines, related to the width of the image; it's lame and inflexible to just have your library require that all working images be multiples of 4/8/16 pixels. Working with .NET? Oh, sorry, the GC heap doesn't support alignment at all -- screwed. The compromise that I generally shoot for is to get a routine that will work with odd widths or non-optimal alignment, although it might be a little slower due to fixup.
There are also cases that simply don't scale to larger vectors. For example, in a scaling or rotation algorithm, I might need to pull 32-bit pixels at various locations and expand them to 64-bit for processing. What am I going to do with 128-bit vectors? I can't pull pairs of pixels from each location, because I only need one. I could process more pixels in parallel, except that I only have eight registers and having four source pointers is hard enough as it is. It's more doable in long mode with 16 GPRs, but I haven't even gotten to that yet.
In terms of instruction selection, the SSE4.1 instruction that seems most useful so far is PMOVZXBW, because it replaces the load/move+PUNPCKLBW that commonly use without eating an extra register for zero. PBLENDW is also looking useful for some alignment scenarios. Other than that, most of the other instructions I think I can abuse are actually from SSSE3, because I'm not that interested in the floating point part of SSE4.1 for VirtualDub. In SSSE3, PHADDD (packed horizontal add doubleword) is turning out quite useful, because I frequently do high precision integer dot products and that means PMADDWD followed by a horizontal add. PSHUFB is also promising, especially given the high speed on SSE4.1-capable CPUs, but that it requires the shuffle pattern to be in memory or in a register and that it works in-place are annoying. PALIGNR looks useful but often requires unrolling due to the immediate argument.
The 8-bit resamplers -- which are used in 1.8.0 when using the "resize" filter with YCbCr video -- got about 20-25% faster with SSE4.1 in my first attempt compared to the MMX version. Unfortunately, I don't know how much this is due to SSE4.1, or just due to the move to 128-bit vectors, since Core 2 is twice as fast at those and Enhanced Core 2 is even faster. I have some ideas for abusing PBLENDW and PSHUFB as well for optimizing conversion between RGB24 and RGB32, but the alignment issue is a bear. I've also been thinking about whether I can speed up the RGB<->YCbCr converters, but PMADDUBSW is the most promising there and the coefficient precision there would be marginal. I also got the idea to abuse MPSADBW for a fast box filter, although the fixed radius would be a bit restrictive and it'd only help horizontally, and I'm not sure what I would use it for.
Overall, I'm not seeing a revolution here compared to what I was doing with MMX and SSE2, but it is a bit nicer overall -- I'm not spending as much time doing register-to-register moves and unpacks as I did. My guess is that if you already have a CPU that's at least SSSE3 capable (Core 2), then you're already going to get most of the benefits instruction-wise, with the difference you're missing being in the microarchitecture and not in the lack of SSE4.1. I'm also beginning to see some strengths and weaknesses of SSSE3/SSE4.1 against AMD's SSE5, at least for image processing. The data movement capabilities of SSSE3/SSE4.1 look superior, but SSE5 has some really compelling ALU operations: PMADCSSWD (packed multiply, add and accumulate signed word to signed doubleword with saturation) looks perfect for what I do. The main question is how fast AMD can get it. I'd heard that the fused-multiply-add unit in Altivec-capable PPC chips was a problem in terms of gating clock speed and the solution was to pipeline it to the point that it became less compelling due to latency; we'll see what happens with SSE5.