¶Implementing ELA in a pixel shader
I've been working on rewriting video algorithms in pixel shaders for 3D acceleration lately, and one of the sticking points I hit was Edge-Based Line Averaging (ELA) interpolation.
ELA is a common spatial interpolation algorithm for deinterlacing and works by trying several angles around the desired point and averaging between the points with the lowest absolute difference. The angles are chosen to be regular steps in sample location, i.e. (x+n, y-1) and (x-n, y+1) for n being small integers. This produces reasonable output for cases where a temporal or motion-based estimation is not available. The specific variant I'm dealing with is part of the Yadif deinterlacing algorithm, which checks three horizontally adjacent pixels for each angle and only picks the farthest two if the intermediate angle is a better match as well. In other words:
for dx = -2 to 2
error[dx] = difference(top[dx - 1], bottom[-dx - 1]) + difference(top[dx], bottom[-dx]) + difference(top[dx + 1], bottom[-dx + 1])best_offset = 0;
for dx = -1 to -2:
if error[dx] < error[best_offset]:
best_offset = dx
else:
breakfor dx = +1 to +2:
if error[dx] < error[best_offset]:
best_offset = dx
else:
breakresult = average(top[dx], bottom[dx])
This should be a relatively simple translation to a pixel shader -- convert each source pixel access to a texture sample. Not. It turns out that under the Direct3D ps_2_0 profile, which is what I need to target, there aren't enough temporary registers to run this algorithm. In order to run the algorithm, at least 14 source pixel fetches need to be done, and there are only 12 temp registers in ps2.0. The HLSL compiler valiantly tries to squeeze everything in and fails. Nuts.
There is an important caveat to this implementation tack, which is that I had source pixels mapped in AoS (array of structures) format, i.e. a single pixel held YCbCr components and an unused alpha channel. The CPU implementation of this algorithm, at least the way I wrote it in VirtualDub 1.9.1+, uses SoA (structures of arrays) orientation for speed. SoA arranges the data as planes of identical components, so instead of mixing components together you fetch a bunch of Y values across multiple pixels, a bunch of Cb pixels, and a bunch of Cr pixels, etc. I decided to try this in the pixel shader, since texture fetches were my main bottleneck. It looked something like this:
float4 top0 = tex2D(src, top_uv0); // top Y 0-3
float4 top1 = tex2D(src, top_uv1); // top Y 4-7
float4 top2 = tex2D(src, top_uv2); // top Y 8-11
float4 bot0 = tex2D(src, bot_uv0); // bottom Y 0-3
float4 bot1 = tex2D(src, bot_uv1); // bottom Y 4-7
float4 bot2 = tex2D(src, bot_uv2); // bottom Y 8-11float4 error_left2 = abs(top0 - bot1)
+ abs(float4(top0.yzw, top1.x), float4(bot1.yzw, bot2.x))
+ abs(float4(top0.zw, top1.xy), float4(bot1.zw, bot2.xy));
Switching to SoA in a pixel shader nullifies some of the advantages of the GPU, since the GPU doesn't use as long vectors as fixed-point hardware (4x vs. SSE2's 16x), and because some GPU hardware doesn't directly support the swizzles you need to emulate a shift. It also largely nullifies the advantage of having texture samplers since you can no longer address the source by individual samples. Well, it turns out in this case that the extra swizzling made the situation even worse than in the AoS case, because the compiler didn't even get halfway down the shader before it gave up.
The main lesson here is that sampling textures can quickly become a bottleneck in the ps_2_0 profile. Just because you have 32 texture sampling instructions available doesn't mean you can use them. I've thought about switching to a higher pixel shader profile, like ps_2_a/b, but there are reasons I want to try to stay to ps_2_0, the main ones being the wide platform availability, the hard resource limits, and the omission of flow control and gradient constructs.
In the end, I had to split the ELA shader into two passes, one which just wrote out offsets to a temporary buffer and another pass that did the interpolation. It works, but the GPU version is only able to attain about 40 fps, whereas the CPU version can hit 60 fps with less than 100% of one core. I guess that mainly speaks to my lopsided computer spec more than anything else. That having been said, it kind of calls into question the "GPUs are much faster" idea. I have no doubt this would run tremendously faster on a GeForce 8800GTX, but it seems that there are plenty of GPUs out there where using the GPU isn't a guaranteed win over the CPU, even for algorithms that are fairly parallelizable.