¶Floating point specials and shader emulation
I decided to work a bit more on my shader compiler and emulator last night, and ran into some unexpectedly ugly problems with floating point specials.
The first problem had to do with the following innocent looking expression:
This returns the magnitude of a vector. The conventional way to compute this is to compute the squared magnitude by taking the dot product of the vector with itself -- the vector analog of squaring a scalar -- and then taking the square root of the result. It turns out that shader hardware doesn't actually support a direct square root. The reciprocal square root, 1/sqrt(x) is easier to compute, and it's also useful in some other cases, most notably normalizing a vector. In this particular case, though, we need the normal square root, and therefore we need to invert it in the assembly:
dp3 r0.x, r_vec, r_vec ;compute dot(vec, vec)
rsq r0.x, r0.x ;compute reciprocal square root of squared magnitude
rcp r0.x, r0.x ;invert reciprocal square root
This little code fragment has a major surprise lurking in it, which may not be apparent until you try optimizing it. Both rsq and rcp are scalar instructions, so in cases where multiple magnitudes are being computed, the temptation is high to replace 1/rsqrt(x) with x*rsqrt(x) instead to take advantage of the much more available mul instruction:
dp3 r0.x, r_vec, r_vec
rsq r0.y, r0.x
mul r0.x, r0.x, r0.y
This works just fine... until you have the misfortune of trying to compute the length of a zero vector. In that case, the reciprocal square root operation returns +Infinity, and then the next thing that happens is the computation 0*+Infinity, which then returns a NaN (invalid result). Suck. Therefore, for the general case, that rcp has to stay there.
The real gotcha comes when you try implementing the rsq and rcp instructions in software. Reciprocal is a very slow instruction on most FPUs, being done with the divide unit and usually taking dozens of nonpipelineable clocks. Reciprocal square root may not even exist in full precision form, and 1/sqrt(x) is a horribly painful way to implement it. If you want to implement this fragment quickly in SSE, you need to take advantage of the rcpps and rsqrtps instructions, which are very fast and work in parallel on four values. They only provide limited precision up to about 2^-12, though. The WPF shader engine just goes ahead and uses the approximation result directly, which works and is accurate enough for half precision (10 bit mantissa), but technically it's not Direct3D compliant as 22 bits of precision are needed.
The usual way to improve the accuracy of the reciprocal and reciprocal square root operations is by iteration through Newton's Method. For the reciprocal, it looks like this:
x = reciprocal_approx(c);
x' = x * (2 - x * c);
...and for the reciprocal square root, it looks like this:
x = rsqrt_approx(c);
x' = 0.5 * x * (3 - x*x*c);
Assuming that you have a good estimate, these will tend to double the number of significant digits per iteration, which means that just one iteration will give us pretty good precision, and quickly, too. And unfortunately wrong, as I discovered when I implemented it. The problem is once again zero. In order for the zero case to work, we need rsq to transform 0 -> Inf and rcp to transform Inf -> 0, but thanks to the x*c expression in both of these iterations, you once again get 0 * Inf = NaN. The way I fixed it was to insert a couple of carefully placed min/max operations in the iteration, although I'm not quite sure they're 100% correct.
The specials struck again when I was trying to optimize the code generated for a gamma correction shader. Gamma correction is primarily a power operation, which expands as follows:
pow(x, y) = exp(y * log(x))
If you actually try gamma correcting an image in this manner, you'll be waiting a long time for the result. For limited precision (8-bit), you can get away with a lookup table, but that doesn't scale well to higher precision or vector computations and definitely not to a shader where floating point is involved. Therefore, in order to get a faster version working, I had to implement the log and exp instructions, which compute the base 2 logarithm and exponential functions. SSE doesn't provide you any help to do this, so you're stuck implementing this from the ground floor. It's a bit like an old BASIC interpreter, except at least you start with add and multiply. This triggered the following conversation with a friend:
"What are you doing?"
"I'm implementing the log() and exp() functions."
"Doesn't the runtime provide those?"
"This is the runtime."
Anyway, I ended up implementing exp2(x) = c*floor(x) + f(x - floor(x)) and log2(x) = c*exponent(x) + g(mantissa(x)). For the most part they're not too hard, as long as you find good polynomial expansions and make sure it's exact at the right values, i.e. you don't want exp(0) = 1.004. In the end, I used a fifth-order polynomial for exp2() and a first-order Padé approximation with a change of variables for log2()... but I digress.
As you have probably already guessed, zero rears its ugly head again here, because 0^y becomes exp2(y * log2(0)), and for this to work you need log2(0) = -Infinity and exp2(-Infinity) = 0. A couple of well-placed min/max operations in the expansions once again fixed the SSE version, but I unexpectedly ran into problems with the x87 version. I hadn't bothered to optimize the x87 version, so it simply called into expf() and scaled the result. Since I compile with /fp:fast /Oi, expf() ended up getting expanded in intrinsic form like this:
fsub st, st(1)
If you look closely, you'll see that this computes (x - round(x)), which in this case is -Inf - (-Inf) = NaN. There are two basic ways to fix this. One is to force the runtime library version of the function, which is faster for SSE2 but unfortunately quite slow for x87. The other way, which is what I did, is to just check for infinity and special-case the result.
Ordinarily I don't think much about floating point specials, and the last place I would have expected to find them is in a pixel shader. I have to say it's been a bit of a humbling (and frustrating) experience.