Current version

v1.10.4 (stable)


Main page
Archived news
Plugin SDK
Knowledge base
Contact info
Other projects


Blog Archive

A shader compiler for VirtualDub

For a while now I've been wishing that I had some nicer way to write video filters. I've been working on VirtualDub long enough that the actions necessary to create a new video filter, write the callbacks, write the pixel processing routines, etc. are mostly second nature. However, it's still a lot of work to toss out a quick video filter. An example was a person who emailed me a while back asking for a YCbCr-to-RGB video filter, because he had the interesting situation of trying to record YPbPr component outputs connected to the RGB component inputs of a video capture card. In the end, it took me a couple of hours to bang out a test filter that ideally should have only taken a few minutes. It would have been nice to come up with a language to write quick filters in, even though it might be slower than straight C++ code:

Well, a couple of weeks ago I sat down and starting banging out such a system. Presenting VDShader:

You need VirtualDub to process video with it (I tested it with 1.6.19 and 1.7.2), but if you just want to play with the editor and compiler, you can run the shader editor as follows:

rundll32 vdshader.vdf,OpenEditor

The VDShader video filter is probably the most complex video filter for VirtualDub that I've ever written, because it consists of:

There are still a lot of things missing from the filter, particularly a lot of primitives in the language, but to give you some idea of what a shader looks like, here's the convert-to-grayscale example shader:

sampler src;
float4 PS(float2 uv: TEXCOORD0) : COLOR0 {
    float4 px;
    return dot(tex2D(src, uv).rgb, float3(0.30, 0.59, 0.11));
technique { pass { PixelShader = compile ps_3_0 PS(); } }

The shader language parsed by the filter is similar to a subset of the kinds of vector processing languages processed by NVIDIA's Cg compiler and Microsoft's High Level Shader Language (HLSL). I chose a similar syntax because I'm familiar with HLSL and this style of syntax is the most fluid syntax I've seen so far for working with colors and vectors. The idea is that if someone wants to try a really quick image processing operation, like scaling green by 80%, it can be expressed in this filter by shader code as simple as (pixel.g * 0.8).

(Details about the internals after the click.)

I had made a previous attempt at something like this with my earlier "Filter Factory" filter, but that had several shortcomings. The main one is that it worked in fixed point, which made it hard to use. Second, it had non-standard syntax. Third, it only had an interpreter, so it was slow. Fourth, I never really documented the language. The killer, though, was that the compiler had a ton of bugs in it, so it never really worked right. Part of the reason the compiler was so bad was that I'd written Filter Factory a long time ago, and had less experience in writing compilers, or even parsers/lexers. Another reason, though, is that Filter Factory built a tree from the code and then compiled that, which meant rather complex control flow during the compilation process, and meant that I spent a lot more time whacking tree nodes instead of actually generating code.

The parser and lexer weren't very hard to write, since I've written enough of them by now. If you're just starting out, you might be tempted to try lex/yacc (flex/bison for free software). They're good to bang something out quickly, but if you're trying to learn, I'd recommend ditching them and writing the lexer and parser yourself. Most programming languages with C-like syntax don't need a full regular expression based lexer or LALR(1) table-based parser -- they can be parsed fine with a simple scanner and a hand-written parser. I've found that it's a lot easier to deal with attribute passing and error handling with a hand-written recursive descent parser than to try to push everything in and out of a yacc-based state machine parser. I'm pretty sure that Visual C++ uses a parser of this type, and I think GCC also switched from a generated parser to a hand-written parser (which probably explains its lowered tendency to just report "parse error").

What I typically use now is a hybrid parser, with recursive descent for structures and operator-precedence for expressions. Operator-precedence avoids having to write zillions of routines for every precedence level, like ParseAdditiveExpression() and ParseMultiplicativeExpression(). (I think Avisynth's expression parser works this way.) I've also evolved a fashion of error handling which doesn't require exceptions, which is to have every parse routine return a bool and propagate up false return values, have the set error routine always return false, and just return the result of the error routine:

if (!ParseExpression(&value))
    return false;
return lexer.SetErrorF("Cannot convert from type '%s' to type '%s'", GetTypeName(vA.mType), GetTypeName(vB.mType));

This is nice for when you're in an environment that doesn't have C++ exception support.

The compiler, this time, is structured as a bottom-up compiler, meaning that it generates shader instructions on the fly as it parses the source. Variables get allocated registers on the spot, and operations on them immediately emit arithmetic instructions. This is a simpler type of compiler to write, but not building an abstract syntax tree (AST) makes some types of high-level transformations difficult or impossible. For instance, algebraic identities are difficult to apply to the assembly output, since complex operations like matrix multiplication are already exploded in the parser. On the other hand, the optimizer turned out to perform better than I had expected, especially given that it only does copy propagation and dead store elimination. Copy propagation detects when copies of values are used and replaces references to the copy with the original, so that given this set of instructions:

mov r0, c0
add r1, r1, r0

...the compiler detects that r0 (temp 0) is a copy of c0 (constant 0), and changes the ADD instruction to use c0 directly:

mov r0, c0
add r1, r1, c0

The dead store elimination (a form of dead code elimination) pass then detects that r0 is never used, and deletes the MOV instruction, producing a better result:

add r1, r1, c0

A factor that made these optimizations easier to implement is that the bottom-up compiler naturally generates code in single static assignment (SSA) form. The old Filter Factory compiler reused registers, which made it really difficult to apply such optimizations because full register life information would have been needed. With SSA, there's only one instruction that writes to a virtual register, meaning that you only need to check if that register is used at all and then kill the instruction that writes it if not. It also helps that I don't support looping or branching, although branching is the easier of the two (it can be handled with a merge instruction, which in the case of a pixel shader would be something like cmp or cnd).

The RGB to YUV shader thus produces the following assembly:

def c0, 0.5, 0.299, -0.1687, 0
def c1, -0.4187, 0.587, -0.3313, 0
def c2, -0.0813, 0.114, 0.5, 0
def c3, 0.501961, 0, 0.501961, 0
texld r1, r0, s0
mul, r1.x, c0
mad, r1.y, c1, r2
mad, r1.z, c2, r3
add, r4, c3
mov, r5
mov r6.w, r1.w
mov o0, r6

This isn't too bad, especially considering that the unoptimized assembly is usually horrible, with as much as ~80% being useless register-to-register moves, and gets really bad whenever the compiler inlines a function call. For comparison, this is what the Microsoft HLSL compiler generates from the same source (fxc /Tfx_2_0 rgb2yuv.fx):

def c0, 0.5, -0.41870001, -0.0812999979, 0
def c1, 0.298999995, 0.587000012, 0.114, 0
def c2, -0.168699995, -0.33129999, 0.5, 0
def c3, 0.501960814, 0, 0, 0
dcl_texcoord v0.xy
dcl_2d s0
texld r0, v0, s0
dp3 r1.x, r0, c0
dp3 r1.y, r0, c1
dp3 r1.z, r0, c2
mov oC0.w, r0.w
add, r1, c3.xyxw

The main problem I've noticed is that my compiler is unable to deal with splitting or merging instructions as necessary to eliminate redundant mov instructions. You might also have noticed that I haven't bothered yet to make the output of my compiler actually valid D3D assembly, since I don't actually ever try running the shader on hardware (there isn't a vertex shader anyway).

Following this, the code goes through a vector-to-scalar recompiler, which takes all of the vector instructions -- which process as many as four values in parallel -- and recompiles them to individual scalar instructions which only handle one component at a time. The result then goes through similar copy propagation and dead store elimination routines. This catches a lot of the redundancy that the vector compiler is unable to eliminate. In the case of the above, this is the result:

def c0, 0.5
def c1, 0.299
def c2, -0.1687
def c3, -0.4187
def c4, 0.587
def c5, -0.3313
def c6, -0.0813
def c7, 0.114
def c8, 0.5
def c9, 0.501961
def c10, 0
def c11, 0.501961
def c12, 0
def c13, 0
def c14, 0
def c15, 0
texld r0, c12, c13
unpackr r1, r0
unpackg r2, r0
unpackb r3, r0
unpacka r0, r0
mul r4, r1, c0
mul r5, r1, c1
mul r1, r1, c2
mul r6, r2, c3
add r4, r6, r4
mul r6, r2, c4
add r5, r6, r5
mul r2, r2, c5
add r1, r2, r1
mul r2, r3, c6
add r4, r2, r4
mul r2, r3, c7
add r5, r2, r5
mul r3, r3, c8
add r1, r3, r1
add r4, r4, c9
add r5, r5, c10
add r1, r1, c11
end c0, c0, c0

There do tend to be a lot of useless constants in the result, since they don't really hurt (I have no limit, unlike real hardware). One big optimization that is missing is constant folding, which would compile out useless operations on only constants, like ADD r0, c0, c1. In vector languages, it's common to do operations like (vec * float3(0.5, 1.0, 0.4)), because a vector operation is the same cost as a scalar operation and there's no cost for the useless multiply of green by 1.0. When recompiling to scalar code, however, the compiler could recognize that one of the factors is 1.0 and replace the multiply with a move, which would then get dropped out by copy propagation.

Following this, the x86 JIT compiler kicks in to compile the shader to machine code. I'll freely admit that my JIT compiler sucks, because it doesn't even try to do register caching -- if you look at the disassembly that appears, you'll see a ton of useless loads and stores. I won't even bother pasting it here because it's too long. However, it turns out that's not the biggest problem. JITting turned out to help a lot less than I'd expected because the biggest bottleneck right now isn't the shader code itself, but the texture sampling, because right now it's a brute force bilinear sampling routine. Because the shader uses x87 floating point, I can't use MMX in that routine and thus it's quite slow at the moment. I'm probably going to try writing an SSE2 version, but the main thing I need to do is finish the sampler support in the compiler so that you can switch the filtering mode to POINT, which would eliminate the need for the bilinear sampling entirely in many shaders.

The last bit of code that I'd decided to write from scratch is the text editor. Truth be told, I hadn't expected or wanted to write a custom text editor for this, but I ended up doing so because I had grown tired of repeatedly having to work around shortcomings and bugs in the Microsoft Rich Edit control, such as:

After I'd had enough of this I decided to rip RichEdit out and bang out a new text editor over the weekend. It's not the greatest editor in the world, particularly since it doesn't support Undo/Redo right now, but at least syntax highlighting was a lot easier to implement and didn't require stupid workarounds. It also only works on 8-bit text right now, so it won't pass Michael Kaplan's "buttload of diacritics" test, but given that this is a source code editor I didn't worry about localization. Besides, converting it to Unicode at least wouldn't be too bad.

One of the toughest features to work out turned out to be caret positioning in word wrap mode -- I'd originally written it using iterators stored as (paragraph, char offset), but discovered that it led to the undesirable behavior that you couldn't place the caret at the end of a wrapped line, because that was the same character position as the start of the next. I ended up rewriting the editor to use (paragraph, line, char offset) instead, which turned out to work better and was actually simpler in some ways. I'm also proud of the fact that the editor doesn't use any fixed size line buffers anywhere, so it doesn't have any line length limitations, like the 2048 character limitation in Visual C++ 6.0.

Anyway, if you run into bugs in the filter, feel free to drop me a line. This has been somewhat of a research project, so I don't know how much more time I'll be putting into it, but I figure it'd at least get it to a useful enough point that simple filters could be written without the need for a real compiler.


This blog was originally open for comments when this entry was first posted, but was later closed and then removed due to spam and after a migration away from the original blog software. Unfortunately, it would have been a lot of work to reformat the comments to republish them. The author thanks everyone who posted comments and added to the discussion.