¶Taking a look at D3D10.1's WARP driver
I recently went through the exercise of writing a basic Direct3D 10.1 display backend for VirtualDub. The primary motivation was to take advantage of Direct3D 10.1 command remoting... until I realized that the DirectX SDK I was using was a bit old and its documentation didn't mention that D3D10.1 command remoting had been removed in Windows 7 RTM. I did get it working in windowed mode, however, and since I had a working D3D10.1 path I figured I might as well check out the WARP driver.
WARP, or Windows Advanced Rasterization Platform, is a software driver that ships with the Direct3D 11 runtime. As far as I know, it's the first widely available and full featured software renderer that Microsoft has shipped. The DirectX SDK has long shipped with the reference rasterizer (refrast), but that has several shortcomings: it's not redistributable, it can't be instantiated in a headless environment, and it's so abysmally slow that it barely works for debugging much less running anything. Microsoft also created RGBRast for DirectX 9 which .NET 3.5 used as a software fallback, but AFAIK it doesn't support shaders and is pretty minimal. The OpenGL software rasterizer works but it pretty slow and lacking on features. WPF has its own software rasterizer that I've written about before and isn't too bad, but it only does pixel shaders on rectangular blits and is internal to WPF. Now we have WARP, which is fully featured, fast, and widely available.
Having used WARP a little bit, I can tell you that you won't be ditching your 3D graphics card anytime soon. When I say WARP is fast, I mean it's fast by software rasterizer standards, which means it might beat an S3 ViRGE. It's still very slow compared to any modern graphics accelerator, even one with "Integrated" in its name, and I get dropped frames drawing one 1440x900 full screen quad on an i5-2500K. That's even before you take into account that even to get that level of performance you have to give up a lot of CPU power that could be used for something else. The main benefit of WARP is that programs can now use 3D rendering without worrying about being 100% screwed in the unusual case where no 3D hardware acceleration whatsoever is available. Considering the difficulty of writing a general 3D software driver, that's a big benefit.
Now, that out of the way, time to look at the details: let's look at what code WARP generates.
I haven't done much with WARP, and VirtualDub's display code is extremely undemanding in the 3D features that it uses. However, it's a good start for seeing how WARP works for non-game applications that are basically looking for a good blitter. We'll use this HLSL effect, which is just meant to draw a quad on screen:
extern Texture2D<float4> srct : register(t0);
extern SamplerState srcs : register(s0);
void VS(
float2 pos : POSITION,
out float4 oPos : SV_Position,
out float2 oT0 : TEXCOORD0)
{
oPos = float4(pos * float2(2, -2) + float2(-1, 1), 0, 1);
oT0 = pos;
}
float4 PS(float4 pos : SV_Position, float2 t0 : TEXCOORD0) : SV_Target {
return srct.Sample(srcs, t0).bgra;
}
We're drawing a whopping four vertices on the screen, so the vertex shader is basically irrelevant for performance, but the pixel shader will be used much more heavily and is what we are interested in. Any modern software rasterizer is going to make use of a just-in-time (JIT) compiler for generating the inner rasterization loop, and WARP is no exception. Since WARP is going to be chewing up gobs of CPU time and runs in-process, find the rasterization loop is not a problem: just break execution in the debugger.
Examining the loop
Due to vectorization, the rasterization loop is very large, so I've put the disassembly at the end of this post instead of here, but let's start by walking through the general layout. This was generated on a CPU with AVX; WARP takes advantage of SSE4.1 but not AVX.
WARP makes use of the SSE2 instruction set and register file, and as such has 4x parallelism available for most operations. One way to take advantage of this is to process all four channels of a pixel in parallel (RGBA), but that's slow for a number of operations that require cross-channel interactions, such as swizzles and dot products. The alternate strategy is to process four pixels at a time with each 4-vector holding four scalar values for each pixel. This reduces shuffling bottlenecks and makes code optimization much easier as the intermediate code can just be treated as scalar, with the downside being increased register pressure from having four times as many pixels in flight. This is the strategy that WPF's rasterizer uses, and it's also the one I picked for my vdshader JITter. WARP's situation is a little bit different because it needs to handle mipmapping and gradient determination, so while it also does four pixels at a time, it does 2x2 quads instead of four pixels horizontally. Most stages just treat them as individual pixels, so this doesn't matter except for a few specific stages like interpolation and output.
Unlike WPF's rasterizer, WARP has to handle arbitrary triangles and thus it supports perspective correction. For the pixel shader above, this involves stepping three interpolated values -- u/w, v/w, and 1/w -- and then computing the reciprocal w = 1/w so that u and v can be computed. I was disappointed to find that WARP does not optimize either the divide itself or for screen-aligned polygons:
movaps xmm3, one ;1.0
divps xmm3, xmm1 ;w = 1 / (1/w)
mulps xmm0, xmm3 ;u = (u/w) * w
mulps xmm1, xmm3 ;v = (v/w) * w
The perspective divide is done with a straight divide instruction instead of reciprocal estimation and refinement. I understand WARP was designed to be general and accurate, so it may be that approximations may not have been sufficient, and in any case with a non-trivial shader this is not going to matter much. In this case, though, it's unnecessary as w is a constant for 2D operations. This means a couple dozen instructions of overhead that 2D applications don't need.
The next section is the address setup for the texture fetch. In this case, mipmapping is disabled and the U/V addressing modes are set to CLAMP. What WARP does here is tidy up the texture coordinates in floating-point (SSE), then it switches to integer (SSE2) to continue computing the addresses in parallel. Since bilinear filtering is enabled, a 2x2 block of pixels has to be fetched. The tricky part about this is that the 2x2 block can extend outside of the texture by one pixel even if the texture coordinate is already clamped or wrapped to 0-1. In vdshader, I handled this by adding borders to the texture storage and copied pixels into the borders beforehand so that a 2x2 block of pixels can be fetched with address offsets; WARP eschews this and instead computes 16 addresses, four per pixel. It then extracts them one at a time with PEXTRD and merges pixels back into vectors with PINSRD.
Afterward is the bilinear filtering, followed by the .bgra swizzle in the pixel shader. (D3D10 wants RGBA textures, so this swizzle is to read a BGRA image that has been aliased into that format.) WARP does the bilinear filtering in integer math by splitting into red/blue and green/alpha pairs. What's surprising is that WARP doesn't then convert the 32-bit pixels into floats -- it instead keeps them packed and does the swizzle in integer math. Chances are this only works for really simple shaders, but this avoids expensive unpacking and conversion to floats followed by conversion and packing back to bytes. It's too bad that WARP didn't take advantage of PSHUFB to do this; it seems that an oddity in the bilinear filtering causes one of the channels to be offset by 1 bit and thus the generated code uses a shift orgy instead to get everything in order.
The final section in the rasterization loop is the output stage:
movq xmm2,mmword ptr [ebx]
movq xmm1,mmword ptr [ebx+eax]
punpcklqdq xmm2,xmm1
movdqa xmm0,xmmword ptr [ebp-1A0h]
pblendvb xmm2,xmmword ptr [ebp-1D0h],xmm0
movq [ebx],xmm2
punpckhqdq xmm2,xmm2
movq [ebx+eax],xmm2
As I said earlier, WARP rasterizes in 2x2 quads instead of 4x1 strips, and therefore it has to fetch and store 8 bytes from two adjacent scan lines (MOVQ instructions). It packs them together into a single vector for blending and then unpacks it afterward, which in this case is probably slower than it would have been to split the pixel shader output and do two blends. The PBLENDVB instruction selects between parallel bytes in two different vectors and appears to be for color write mask support, where individual RGBA channels in the frame buffer can be enabled or disabled for write. In this case all color channels were enabled for write and there was no need to read or merge from the source, so this is all unnecessary. As with the perspective divide, though, it's pretty small fish compared to the rest of the loop.
It's worth noting that WARP supports x64 as well as x86. I didn't spend much time looking at the generated 64-bit code, but it looks mostly the same except for fewer register spills due to the larger register file.
What this means
As far as I can tell, WARP is a well-written and performant software renderer. It's also trivial to enable in an existing D3D10/11 based program. You could do a lot worse than using WARP, and if you need a general fallback for a bit of 3D rendering I'd seriously consider it over another one or writing a custom one. That's assuming of course that you can use it -- it's a bummer that it's only available for Vista and up and requires a DX10 or DX11 based application. As with any other software renderer, WARP still doesn't perform miracles and if you're doing any non-trivial amount of 3D graphics it will not make up for a missing 3D accelerator.
What it isn't necessarily good for is 2D image operations. It will do image blits and compositing just fine, and undoubtably it'd be better than a random routine written on top of GetPixel() and PutPixel(). You could use it for rendering UI and get acceptable performance. However, WARP will get stomped hard by specialized 2D rendering code, as just its perspective correction code alone is bigger than the inner loop of a bilinear stretch routine, and it is severely hampered by the complexity of emulating texture fetches on the CPU. I haven't done any benchmarks but it's possible that GDI+ is faster. Therefore, even though it takes advantage of SSE2 and JIT compilation, there are still much better options for a 2D image processing core.
Appendix: WARP generated code
This is the inner loop generated by WARP for the pixel shader.
paddd xmm0,xmmword ptr [ebp-170h]
movaps xmm1,xmmword ptr [ebp-100h]
mulps xmm1,xmmword ptr [ebp-160h]
movdqa xmm2,xmmword ptr [ebp-150h]
pcmpgtd xmm2,xmm0
addps xmm1,xmmword ptr [ebp-140h]
movaps xmm0,xmmword ptr [ebp-50h]
mulps xmm0,xmmword ptr [ebp-160h]
movaps xmm3,xmmword ptr ds:[7EF917A0h]
divps xmm3,xmm1
addps xmm0,xmmword ptr [ebp-180h]
movaps xmm1,xmmword ptr [ebp-80h]
mulps xmm1,xmmword ptr [ebp-160h]
mulps xmm0,xmm3
addps xmm1,xmmword ptr [ebp-120h]
movaps xmm4,xmmword ptr ds:[7EF91790h]
minps xmm4,xmm0
mulps xmm1,xmm3
movaps xmm0,xmmword ptr ds:[7EF91780h]
maxps xmm0,xmm4
movaps xmm3,xmmword ptr ds:[7EF91790h]
minps xmm3,xmm1
mulps xmm0,xmmword ptr [ebp-40h]
movaps xmm1,xmmword ptr ds:[7EF91780h]
maxps xmm1,xmm3
addps xmm0,xmmword ptr ds:[7EF91760h]
mulps xmm1,xmmword ptr [ebp-0D0h]
movaps xmm3,xmm0
cmpeqps xmm0,xmm0
addps xmm1,xmmword ptr ds:[7EF91760h]
pand xmm0,xmm3
movaps xmm3,xmm1
cmpeqps xmm1,xmm1
mulps xmm0,xmmword ptr ds:[7EF91750h]
pand xmm1,xmm3
cvtps2dq xmm0,xmm0
mulps xmm1,xmmword ptr ds:[7EF91750h]
movdqa xmm3,xmm0
psrad xmm0,8
pand xmm3,xmmword ptr ds:[7EF91740h]
movdqa xmm4,xmmword ptr ds:[7EF91730h]
paddd xmm4,xmm0
movdqa xmm5,xmm3
pslld xmm3,10h
pmaxsd xmm0,xmmword ptr ds:[7EF917E0h]
por xmm3,xmm5
pminsd xmm0,xmmword ptr [ebp-0C0h]
pmaxsd xmm4,xmmword ptr ds:[7EF917E0h]
cvtps2dq xmm1,xmm1
pminsd xmm4,xmmword ptr [ebp-0C0h]
movdqa xmm5,xmm1
psrad xmm1,8
pand xmm5,xmmword ptr ds:[7EF91740h]
movdqa xmm6,xmmword ptr ds:[7EF91730h]
paddd xmm6,xmm1
movdqa xmm7,xmm5
pslld xmm5,10h
pmaxsd xmm1,xmmword ptr ds:[7EF917E0h]
por xmm5,xmm7
pminsd xmm1,xmmword ptr [ebp-0E0h]
pmaxsd xmm6,xmmword ptr ds:[7EF917E0h]
pmulld xmm1,xmmword ptr [ebp-0F0h]
pminsd xmm6,xmmword ptr [ebp-0E0h]
movdqa xmmword ptr [ebp-190h],xmm3
movdqa xmm3,xmm0
paddd xmm0,xmm1
paddd xmm1,xmm4
movd eax,xmm0
pextrd ecx,xmm0,1
movd xmm7,dword ptr [edx+eax*4]
pextrd eax,xmm0,2
pinsrd xmm7,dword ptr [edx+ecx*4],1
pextrd ecx,xmm0,3
pinsrd xmm7,dword ptr [edx+eax*4],2
movd eax,xmm1
pinsrd xmm7,dword ptr [edx+ecx*4],3
pextrd ecx,xmm1,1
movdqa xmm0,xmmword ptr ds:[7EF91720h]
pand xmm0,xmm7
psrlw xmm7,8
movdqa xmmword ptr [ebp-1A0h],xmm2
movd xmm2,dword ptr [edx+eax*4]
pextrd eax,xmm1,2
pinsrd xmm2,dword ptr [edx+ecx*4],1
pextrd ecx,xmm1,3
pinsrd xmm2,dword ptr [edx+eax*4],2
pmulld xmm6,xmmword ptr [ebp-0F0h]
pinsrd xmm2,dword ptr [edx+ecx*4],3
paddd xmm3,xmm6
movdqa xmm1,xmmword ptr ds:[7EF91720h]
pand xmm1,xmm2
psrlw xmm2,8
movd eax,xmm3
pextrd ecx,xmm3,1
movdqa xmmword ptr [ebp-1B0h],xmm2
movd xmm2,dword ptr [edx+eax*4]
pextrd eax,xmm3,2
pinsrd xmm2,dword ptr [edx+ecx*4],1
pextrd ecx,xmm3,3
pinsrd xmm2,dword ptr [edx+eax*4],2
paddd xmm4,xmm6
pinsrd xmm2,dword ptr [edx+ecx*4],3
movd eax,xmm4
movdqa xmm3,xmmword ptr ds:[7EF91720h]
pand xmm3,xmm2
psrlw xmm2,8
movd xmm6,dword ptr [edx+eax*4]
pextrd eax,xmm4,1
pextrd ecx,xmm4,2
pinsrd xmm6,dword ptr [edx+eax*4],1
pextrd eax,xmm4,3
pinsrd xmm6,dword ptr [edx+ecx*4],2
movdqa xmm4,xmm0
psllw xmm0,8
pinsrd xmm6,dword ptr [edx+eax*4],3
psubw xmm3,xmm4
movdqa xmm4,xmmword ptr ds:[7EF91720h]
pand xmm4,xmm6
psrlw xmm6,8
pmullw xmm3,xmm5
movdqa xmmword ptr [ebp-1C0h],xmm6
movdqa xmm6,xmm7
psllw xmm7,8
paddw xmm3,xmm0
psubw xmm2,xmm6
psrlw xmm3,1
pmullw xmm2,xmm5
movdqa xmm0,xmm1
psllw xmm1,8
paddw xmm2,xmm7
psubw xmm4,xmm0
psrlw xmm2,1
pmullw xmm4,xmm5
movdqa xmm0,xmmword ptr [ebp-1B0h]
psllw xmm0,8
paddw xmm4,xmm1
movdqa xmm1,xmmword ptr [ebp-1C0h]
psubw xmm1,xmmword ptr [ebp-1B0h]
psrlw xmm4,1
pmullw xmm1,xmm5
psubw xmm4,xmm3
paddw xmm1,xmm0
movdqa xmm0,xmmword ptr [ebp-190h]
psllw xmm0,8
psrlw xmm1,1
movdqa xmm5,xmm0
pmulhuw xmm0,xmm4
psraw xmm4,0Fh
psubw xmm1,xmm2
pand xmm4,xmm5
movdqa xmm6,xmm5
pmulhuw xmm5,xmm1
psubw xmm0,xmm4
psraw xmm1,0Fh
paddw xmm0,xmm3
pand xmm1,xmm6
movdqa xmm3,xmmword ptr ds:[7EF91710h]
pand xmm3,xmm0
psubw xmm5,xmm1
psrld xmm0,10h
paddw xmm5,xmm2
psrld xmm3,7
movdqa xmm1,xmmword ptr ds:[7EF91710h]
pand xmm1,xmm5
psrld xmm5,10h
pslld xmm3,10h
psrld xmm5,7
psrld xmm1,7
pslld xmm5,18h
pslld xmm1,8
psrld xmm0,7
movq xmm2,mmword ptr [ebx]
por xmm0,xmm1
mov eax,dword ptr [ebp-88h]
movq xmm1,mmword ptr [ebx+eax]
por xmm0,xmm3
punpcklqdq xmm2,xmm1
por xmm0,xmm5
movaps xmm1,xmmword ptr [ebp-160h]
addps xmm1,xmmword ptr ds:[7EF91700h]
movdqa xmmword ptr [ebp-1D0h],xmm0
movdqa xmm0,xmmword ptr [ebp-1A0h]
pblendvb xmm2,xmmword ptr [ebp-1D0h],xmm0
movq mmword ptr [ebx],xmm2
punpckhqdq xmm2,xmm2
movq mmword ptr [ebx+eax],xmm2
movdqa xmm0,xmmword ptr [ebp-130h]
paddd xmm0,xmmword ptr ds:[7EF916F0h]
lea ebx,[ebx+8]
sub esi,1
movdqa xmmword ptr [ebp-130h],xmm0
movaps xmmword ptr [ebp-160h],xmm1
jne 7ef91260