Current version

v1.10.4 (stable)


Main page
Archived news
Plugin SDK
Knowledge base
Contact info
Other projects



01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004


Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ A shader compiler for VirtualDub

For a while now I've been wishing that I had some nicer way to write video filters. I've been working on VirtualDub long enough that the actions necessary to create a new video filter, write the callbacks, write the pixel processing routines, etc. are mostly second nature. However, it's still a lot of work to toss out a quick video filter. An example was a person who emailed me a while back asking for a YCbCr-to-RGB video filter, because he had the interesting situation of trying to record YPbPr component outputs connected to the RGB component inputs of a video capture card. In the end, it took me a couple of hours to bang out a test filter that ideally should have only taken a few minutes. It would have been nice to come up with a language to write quick filters in, even though it might be slower than straight C++ code:

Well, a couple of weeks ago I sat down and starting banging out such a system. Presenting VDShader:

You need VirtualDub to process video with it (I tested it with 1.6.19 and 1.7.2), but if you just want to play with the editor and compiler, you can run the shader editor as follows:

rundll32 vdshader.vdf,OpenEditor

The VDShader video filter is probably the most complex video filter for VirtualDub that I've ever written, because it consists of:

There are still a lot of things missing from the filter, particularly a lot of primitives in the language, but to give you some idea of what a shader looks like, here's the convert-to-grayscale example shader:

sampler src;
float4 PS(float2 uv: TEXCOORD0) : COLOR0 {
    float4 px;
    return dot(tex2D(src, uv).rgb, float3(0.30, 0.59, 0.11));
technique { pass { PixelShader = compile ps_3_0 PS(); } }

The shader language parsed by the filter is similar to a subset of the kinds of vector processing languages processed by NVIDIA's Cg compiler and Microsoft's High Level Shader Language (HLSL). I chose a similar syntax because I'm familiar with HLSL and this style of syntax is the most fluid syntax I've seen so far for working with colors and vectors. The idea is that if someone wants to try a really quick image processing operation, like scaling green by 80%, it can be expressed in this filter by shader code as simple as (pixel.g * 0.8).

(Details about the internals after the click.)

I had made a previous attempt at something like this with my earlier "Filter Factory" filter, but that had several shortcomings. The main one is that it worked in fixed point, which made it hard to use. Second, it had non-standard syntax. Third, it only had an interpreter, so it was slow. Fourth, I never really documented the language. The killer, though, was that the compiler had a ton of bugs in it, so it never really worked right. Part of the reason the compiler was so bad was that I'd written Filter Factory a long time ago, and had less experience in writing compilers, or even parsers/lexers. Another reason, though, is that Filter Factory built a tree from the code and then compiled that, which meant rather complex control flow during the compilation process, and meant that I spent a lot more time whacking tree nodes instead of actually generating code.

The parser and lexer weren't very hard to write, since I've written enough of them by now. If you're just starting out, you might be tempted to try lex/yacc (flex/bison for free software). They're good to bang something out quickly, but if you're trying to learn, I'd recommend ditching them and writing the lexer and parser yourself. Most programming languages with C-like syntax don't need a full regular expression based lexer or LALR(1) table-based parser -- they can be parsed fine with a simple scanner and a hand-written parser. I've found that it's a lot easier to deal with attribute passing and error handling with a hand-written recursive descent parser than to try to push everything in and out of a yacc-based state machine parser. I'm pretty sure that Visual C++ uses a parser of this type, and I think GCC also switched from a generated parser to a hand-written parser (which probably explains its lowered tendency to just report "parse error").

What I typically use now is a hybrid parser, with recursive descent for structures and operator-precedence for expressions. Operator-precedence avoids having to write zillions of routines for every precedence level, like ParseAdditiveExpression() and ParseMultiplicativeExpression(). (I think Avisynth's expression parser works this way.) I've also evolved a fashion of error handling which doesn't require exceptions, which is to have every parse routine return a bool and propagate up false return values, have the set error routine always return false, and just return the result of the error routine:

if (!ParseExpression(&value))
    return false;
return lexer.SetErrorF("Cannot convert from type '%s' to type '%s'", GetTypeName(vA.mType), GetTypeName(vB.mType));

This is nice for when you're in an environment that doesn't have C++ exception support.

The compiler, this time, is structured as a bottom-up compiler, meaning that it generates shader instructions on the fly as it parses the source. Variables get allocated registers on the spot, and operations on them immediately emit arithmetic instructions. This is a simpler type of compiler to write, but not building an abstract syntax tree (AST) makes some types of high-level transformations difficult or impossible. For instance, algebraic identities are difficult to apply to the assembly output, since complex operations like matrix multiplication are already exploded in the parser. On the other hand, the optimizer turned out to perform better than I had expected, especially given that it only does copy propagation and dead store elimination. Copy propagation detects when copies of values are used and replaces references to the copy with the original, so that given this set of instructions:

mov r0, c0
add r1, r1, r0

...the compiler detects that r0 (temp 0) is a copy of c0 (constant 0), and changes the ADD instruction to use c0 directly:

mov r0, c0
add r1, r1, c0

The dead store elimination (a form of dead code elimination) pass then detects that r0 is never used, and deletes the MOV instruction, producing a better result:

add r1, r1, c0

A factor that made these optimizations easier to implement is that the bottom-up compiler naturally generates code in single static assignment (SSA) form. The old Filter Factory compiler reused registers, which made it really difficult to apply such optimizations because full register life information would have been needed. With SSA, there's only one instruction that writes to a virtual register, meaning that you only need to check if that register is used at all and then kill the instruction that writes it if not. It also helps that I don't support looping or branching, although branching is the easier of the two (it can be handled with a merge instruction, which in the case of a pixel shader would be something like cmp or cnd).

The RGB to YUV shader thus produces the following assembly:

def c0, 0.5, 0.299, -0.1687, 0
def c1, -0.4187, 0.587, -0.3313, 0
def c2, -0.0813, 0.114, 0.5, 0
def c3, 0.501961, 0, 0.501961, 0
texld r1, r0, s0
mul, r1.x, c0
mad, r1.y, c1, r2
mad, r1.z, c2, r3
add, r4, c3
mov, r5
mov r6.w, r1.w
mov o0, r6

This isn't too bad, especially considering that the unoptimized assembly is usually horrible, with as much as ~80% being useless register-to-register moves, and gets really bad whenever the compiler inlines a function call. For comparison, this is what the Microsoft HLSL compiler generates from the same source (fxc /Tfx_2_0 rgb2yuv.fx):

def c0, 0.5, -0.41870001, -0.0812999979, 0
def c1, 0.298999995, 0.587000012, 0.114, 0
def c2, -0.168699995, -0.33129999, 0.5, 0
def c3, 0.501960814, 0, 0, 0
dcl_texcoord v0.xy
dcl_2d s0
texld r0, v0, s0
dp3 r1.x, r0, c0
dp3 r1.y, r0, c1
dp3 r1.z, r0, c2
mov oC0.w, r0.w
add, r1, c3.xyxw

The main problem I've noticed is that my compiler is unable to deal with splitting or merging instructions as necessary to eliminate redundant mov instructions. You might also have noticed that I haven't bothered yet to make the output of my compiler actually valid D3D assembly, since I don't actually ever try running the shader on hardware (there isn't a vertex shader anyway).

Following this, the code goes through a vector-to-scalar recompiler, which takes all of the vector instructions -- which process as many as four values in parallel -- and recompiles them to individual scalar instructions which only handle one component at a time. The result then goes through similar copy propagation and dead store elimination routines. This catches a lot of the redundancy that the vector compiler is unable to eliminate. In the case of the above, this is the result:

def c0, 0.5
def c1, 0.299
def c2, -0.1687
def c3, -0.4187
def c4, 0.587
def c5, -0.3313
def c6, -0.0813
def c7, 0.114
def c8, 0.5
def c9, 0.501961
def c10, 0
def c11, 0.501961
def c12, 0
def c13, 0
def c14, 0
def c15, 0
texld r0, c12, c13
unpackr r1, r0
unpackg r2, r0
unpackb r3, r0
unpacka r0, r0
mul r4, r1, c0
mul r5, r1, c1
mul r1, r1, c2
mul r6, r2, c3
add r4, r6, r4
mul r6, r2, c4
add r5, r6, r5
mul r2, r2, c5
add r1, r2, r1
mul r2, r3, c6
add r4, r2, r4
mul r2, r3, c7
add r5, r2, r5
mul r3, r3, c8
add r1, r3, r1
add r4, r4, c9
add r5, r5, c10
add r1, r1, c11
end c0, c0, c0

There do tend to be a lot of useless constants in the result, since they don't really hurt (I have no limit, unlike real hardware). One big optimization that is missing is constant folding, which would compile out useless operations on only constants, like ADD r0, c0, c1. In vector languages, it's common to do operations like (vec * float3(0.5, 1.0, 0.4)), because a vector operation is the same cost as a scalar operation and there's no cost for the useless multiply of green by 1.0. When recompiling to scalar code, however, the compiler could recognize that one of the factors is 1.0 and replace the multiply with a move, which would then get dropped out by copy propagation.

Following this, the x86 JIT compiler kicks in to compile the shader to machine code. I'll freely admit that my JIT compiler sucks, because it doesn't even try to do register caching -- if you look at the disassembly that appears, you'll see a ton of useless loads and stores. I won't even bother pasting it here because it's too long. However, it turns out that's not the biggest problem. JITting turned out to help a lot less than I'd expected because the biggest bottleneck right now isn't the shader code itself, but the texture sampling, because right now it's a brute force bilinear sampling routine. Because the shader uses x87 floating point, I can't use MMX in that routine and thus it's quite slow at the moment. I'm probably going to try writing an SSE2 version, but the main thing I need to do is finish the sampler support in the compiler so that you can switch the filtering mode to POINT, which would eliminate the need for the bilinear sampling entirely in many shaders.

The last bit of code that I'd decided to write from scratch is the text editor. Truth be told, I hadn't expected or wanted to write a custom text editor for this, but I ended up doing so because I had grown tired of repeatedly having to work around shortcomings and bugs in the Microsoft Rich Edit control, such as:

After I'd had enough of this I decided to rip RichEdit out and bang out a new text editor over the weekend. It's not the greatest editor in the world, particularly since it doesn't support Undo/Redo right now, but at least syntax highlighting was a lot easier to implement and didn't require stupid workarounds. It also only works on 8-bit text right now, so it won't pass Michael Kaplan's "buttload of diacritics" test, but given that this is a source code editor I didn't worry about localization. Besides, converting it to Unicode at least wouldn't be too bad.

One of the toughest features to work out turned out to be caret positioning in word wrap mode -- I'd originally written it using iterators stored as (paragraph, char offset), but discovered that it led to the undesirable behavior that you couldn't place the caret at the end of a wrapped line, because that was the same character position as the start of the next. I ended up rewriting the editor to use (paragraph, line, char offset) instead, which turned out to work better and was actually simpler in some ways. I'm also proud of the fact that the editor doesn't use any fixed size line buffers anywhere, so it doesn't have any line length limitations, like the 2048 character limitation in Visual C++ 6.0.

Anyway, if you run into bugs in the filter, feel free to drop me a line. This has been somewhat of a research project, so I don't know how much more time I'll be putting into it, but I figure it'd at least get it to a useful enough point that simple filters could be written without the need for a real compiler.


Comments posted:

My knowledge in this area is nearly zero, but your article made me wonder why you couldn't have just used the GPU with an existing shader language like GLSL or HLSL? You wouldn't have had to write your own parser/interpreter/JIT and you would have benefited from the crazy raw speed available on the GPU.

The thing that I have no idea about is whether the existing shader languages fit the video filter type of problem. I refer you to the general purpose GPU site for interest's sake:

Michael Beauregard - 08 07 07 - 09:15

I admit that writing parsers/compilers is a rather tempting task for every developer, but why didn't you reuse the D3DX compiler to emit the PS3 asm? This way, you could have the full PS3 syntax for free and could have concentrated on the JITter only.

As for the text edior: why did you decide against Scintilla?

Björn Graf (link) - 08 07 07 - 11:25

Ah, classic questions. Allow me to answer:

I didn't want to use the GPU because I don't want to rely on the GPU -- I want a CPU fallback in all cases. You can't run anything on the GPU if you are on a Terminal Services / Remote Desktop session, and with Direct3D 9 you also have the hassle of lost devices, which can hose your output. It's also fairly difficult to do any complex image processing if you have a GPU that doesn't support Pixel Shader 2.0, and it can be annoying on straight PS2.0 -- bicubic sampling will easily take around 7 tex ops + 15 ALU ops out of total 32+64. Then there are issues like, what if you want to run on images that are bigger than the GPU can handle? Some GPUs can't handle above 1Kx1K, many can't do above 2Kx2K, and even the newest are limited to 4Kx4K.

Second question, HLSL/GLSL: GLSL is not usable for the first point as the compiler is built into the display driver itself, so you can't ever see the assembly. HLSL is usable, and I have done some explorations into interpreting Direct3D shader bytecode, but there's a severe problem in that this introduces a dependency on the Microsoft D3DX shader compiler. D3DX is a royal PITA to redistribute -- according to the license you can ONLY redistribute it with DXSETUP, which means that this tiny 200K filter would now require a multi-megabyte installer that can only run as administrator. And for bonus points, you need to redistribute the version that corresponds to the DirectX SDK that you compiled with, but all of them come up with "Installing DirectX 9.0c" as the title, which means that users tend to cancel them thinking that they already have DirectX 9.0c installed, not realizing that it's D3DX9_31.DLL that's needed.

Another factor is that I'm not sure I want to stay compatible with GLSL/HLSL, because pure hardware shaders aren't exactly the easiest to write filters in. Convolution filters in particular are an extremely common primitive, but they often require several lines of code per tap, and it'd be nice to have an easier way to do them. (I'm thinking of tex2DOffset() as a primitive, which takes an integer texel space offset. D3D9 lacks this; I forget if D3D10 and GLSL do. NV_fragment_program4 might.) Setting up temp render targets tends to be a pain too. I'm talking from experience here -- VirtualDub's D3D display code does take HLSL shaders, and it took a lot more code than it should have to do ELA deinterlacing with a bicubic stretch.

I did actually look into the Cg compiler at one point, but it turns out that NVIDIA only released the front end, and not the code generator.

As for deciding against Scintilla: Scintilla's a fine component, sure. The answers are that (a) I hate dependencies, and (b) I've always wanted to write a text editor. Didn't want to do it now, but I already had one half written for a game walkthrough reader I had been working on, so I just finished it. I actually also already had a low-level x86 JIT back end lying around that I had written in exploration of starting a Purify clone... as you may imagine, I have a lot of unfinished stuff on my hard drive.

Phaeron - 08 07 07 - 17:10

Oh, forgot to comment on the GPGPU stuff:

I did look a bit at AMD's Close to Metal (CTM) and NVIDIA's CUDA, the vendor GPGPU solutions, and both didn't look that interesting for my purposes. The killer is that both have very high hardware requirements: X1x00 for CTM, GeForce 8 for CUDA. CTM was a lot lower level and chip-specific than I would have liked, but they've expanded it a lot more than the last time I looked, adding a lot of missing support for texture filtering; you're essentially building HW command buffers at this point, which is really cool. The bad part of CTM is that you can't interop with the display. CUDA is more abstracted, but they do have a nice C compiler, and it *does* have support for interop with both Direct3D and OpenGL. It looked like OpenGL was the smoother of the two as it just wrote into a buffer and you could load into a texture from there; Direct3D was rougher as you could only write into vertex buffers and would to draw a million point primitives from there, which would slow down hardware accelerated display.

Phaeron - 08 07 07 - 17:17

Very interesting. Thanks for sharing more of your thinking on the subject. More importantly, it looks like you're having fun.


Michael Beauregard - 08 07 07 - 19:59

Oh, yeah, the fun of distributing DX; how could I ever forget that one... Side note: You always should start any DX9c setups in quiet mode to prevent users from doing not-so-educated actions :]

Björn Graf (link) - 08 07 07 - 22:07

hm, swShader did something like this, until they sold their soul.

Gabest - 09 07 07 - 18:20

So, what's new in 1.7.2 ? I only have 1.7.1.

mikesum32 - 10 07 07 - 02:40

I've been working on something similar to this for a dx9 software renderer vertex/pixel shader compiler for a bit now. We break everything down into scalar arithmetic like you do above, perform all the optimization work there, then expand that out with SSE to do four elements at a time. With loops and conditional execution this can get a bit tricky and painful (masking stinks on SSE), but for single basic blocks it can really speed things up. Especially on the Core 2 Duos which have full 128-bit SSE ops. AMD's Barcelona is supposed to have this also.

"Because the shader uses x87 floating point, I can't use MMX in that routine and thus it's quite slow at the moment."

My experiences with processors since the P3 usually have emms being really cheap. (Great reference for this is Agner's site: My guess is you could MMX/SSE that sampling routine and really speed it up even if you're using x87 elsewhere. If you're using MSVC, you'll have to do SSE2 w/ intrinsics or real asm since those turkeys don't support inline asm or MMX intrinsics. Grr. This DDJ article on the Pixomatic texture sampler might be helpful if you haven't seen it also: Specifically, Listing Three. This was done a while ago so we didn't use any SSE2 there though - but maybe it'll give you some ideas on speeding yours up.

Here are a couple of decent papers/links which I found helpful while going through this compiler work also:

A Simple Graph-Based Intermediate Representation
Cliff Click, Michael Paleczny

Improvements to Linear Scan register allocation
Alkis Evlogimenos

If you ever get into loops and control statements, I really liked this one also:

Single-Pass Generation of Static Single Assignment Form for Structured Languages
M. M. Brandis and H. Mössenböck

Nice work Björn. I'm impressed with how much time you must have put into this.

Michael Sartain - 17 07 07 - 15:02

Oops - I should have clarified x64 on that comment about MSVC not supporting MMX intrinsics or inline asm. Just means I usually wind up writing two versions of everything that uses MMX (asm or sse2 for x64 and mmx/sse/sse2 for x86) which is a bit of a pain.

Michael Sartain - 17 07 07 - 15:10

What about implementing a superset of HLSL? People could re-use their code/experience in HLSL, and it would even be possible to write a DirectX10 pixel shader driver, or at least re-use the API for those who are using the pixel shader for gpgpu coding. If you made it take advantage of the multicore processors that are coming out, it might even outperform GPUs for gpgpu programming since you wouldn’t have the memory bandwidth limitations, but you would have all of the advantages of vector processing (sort of like a virtual Cray super computer).

David - 09 10 07 - 10:14

are there any plans to support conditional statements and add min() and max() and other standard functions?

DT - 18 05 08 - 19:03

Not at this point... vdshader got complicated really fast for what I wanted to do with it, both in the implementation and actually using it. Hooking up real hardware shaders to the VirtualDub pipeline is what I've been experimenting with lately, but there are problems in terms of what shader language to use, what API to use, handling borders, etc. 3D in general isn't really geared to image processing. :(

Phaeron - 18 05 08 - 23:03

Comment form