Current version

v1.10.4 (stable)


Main page
Archived news
Plugin SDK
Knowledge base
Contact info
Other projects



01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004


Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Filter system changes in VirtualDub 1.9.1

A major part of the work I did in VirtualDub 1.9.1 was to rework the filter system in order to support filters running in an N:1 frame configuration rather than lock-step 1:1 mode. This is particularly important for filters that need to reference a frame window, which is tricky to do in earlier versions. Making this change involved a lot more rework than I had anticipated and I learned a lot along the way.

To review, the filter system in 1.9.0 and earlier is based on a lock-step pipeline model where one frame goes in, each filter runs exactly once, and one frame comes out:

frame = input[i];
for each filter F:
frame =;
output[i] = frame;

This is relatively simple, and also memory efficient: only two main frame buffers are ever active at a time, and thus they can be overlapped to save space. In 1.4x, this was extended to accommodate internal delays by tracking the delays of all the filters and compensating for it at the end. In 1.8.x, this was further extended to support frame rate changing, resulting in the slightly more complex flow:

delay = 0;
for each filter F in reverse:
i = F.prefetch(i);
delay += F.get_delay();
frame = input[i];
for each filter F:
frame =;
output[i + delay] = frame;

A lot could be done with this model, but there were several shortcomings -- filters that did change the frame rate tended to do ugly things to upstream filters due to requesting duplicate frames, and lag wasn't compensated on rendering, not on the timeline. In addition, the need for filters to buffer frames internally to support windows added complexity and slowed down rendering speed.

I should point out some of the similarities to Microsoft DirectShow at this point. DirectShow also uses a push model, but it has two additional weapons that improve its performance. One is that each filter can control whether it passes frames to the downstream and how many frames are passed, which accommodates variable frame rates. The second is that it uses refcounted buffers with flexible allocators, which means that filters can cache frames as they pass through simply by calling AddRef() on them. I had thought about switching to a model like this, but decided against it as it still had some of the same shortcomings with regard to random access and serialization between filters.

In 1.9.1, the filter system was rewritten from a push model to a pull model, with each filter being modeled as a function with a frame cache after it. The pipeline then runs a series of independent filters that all use logic like this:

for each new request:
if the requested frame is in the cache:
return the cached frame
build a list of source frames needed
issue a prefetch request to the upstream filter for each source frame
queue the request
wait until a request in the queue is ready
if any source frames have failed on the request:
fail this request
process the frame
cache the result

This model is similar to that used by Avisynth, in which each filter fetches frames from the upstream and then pushes a result frame downstream through a cache. The main difference, however, is that fetching source frames and processing those into an output frame are decoupled. The reason I did this is that it allows filters to run in parallel, which is particularly important for the head of the filter chain where frames are read from disk and you don't want to stall frame processing on I/O. The 1.9.1 pipeline queues up to 32 frame requests with default settings, and that keeps the I/O thread busy without stalling the processing thread as in earlier versions.

Those of you who have been tracking the VirtualDub source may have noticed a period of time when I had attempted to rewrite the filter system in this fashion before and had shipped it as unused modules in the source. The main problem with my earlier attempt was that I had tried to make the entire filter system multithreaded, which turned into a nightmare, both in terms of stability and memory usage. The stability problem comes with the need to make nearly everything thread-safe, which is difficult when you run into situations like a filter request completing on another thread while the RequestFrame() function is still running. The memory usage problem comes into play when upstream filters are able to produce frames faster than the downstream can consume them -- more on that later. A third problem is that although I got the new filter system running, it was never completely integrated into the rest of the engine, nor did it ever get to the point of fully supporting the existing filter API. A new and better system that doesn't cleanly interface with anyone who needs to use it is pretty useless. As a result, I eventually abandoned that branch and continued evolving the existing filter pipeline through 1.9.0.

There are, of course, a number of subtleties to getting a system like this working.

In-place filters. VirtualDub supports two main buffering modes for video filters, swap and in-place. In swap mode, filters are asked to process a source frame buffer to a separate destination buffer; in in-place mode, the filter receives only one frame buffer and modifies the pixels in-place. The choice of mode is up to the filter and whichever one is better depends on the filter algorithm. When it is convenient to implement, in-place mode is more efficient because it only uses one buffer, which means less memory for the CPU caches to deal with and less pointers to maintain in the inner loop. Filters that are purely a function on each pixel value or are doing simple rendering on top of the video can usually work this way. Caching frames throws a kink into this mode, however, because in-place processing destroys the source frame and that means it can't be cached. This can be a severe performance problem if two or more clients are pulling frames from a filter and one of the downstream clients is an in-place filter. 1.9.1 solves this by requiring that frame requests indicate whether a writable frame is needed and through a predictor that tracks frames even after they've been evicted from the cache. If within a certain window more than one request arrives for a frame and at least one of the requests is a writable request, the predictor marks the frame as shareable and in-place filters do a copy instead of stealing the source buffer.

Caching and memory usage. Caching is wonderful for avoiding redundant work, but Raymond Chen reminds us that a bad caching policy is another name for a memory leak. Doing a little more caching than necessary is fine when you're dealing with 100 byte strings; it's a bit more problematic when you are caching 3MB video frames. In 1.9.1, the frame caches are primarily intended to avoid redundant frame fetching at the local level, and thus have an aggressive trimming policy: the allocators periodically track the high watermark for referenced frames and continuously trim down to working set without allowing for speculative caching. This results in memory usage close to 1.8.8/1.9.0, and I have some allocator merging improvements in 1.9.2 to improve this further. I may allow for speculative caching in the future, but I'm not a fan of the "use 50% of physical memory" method of caching -- that generally leads to wasteful memory usage and also pretty bad swapping if three applications each decide to take 50%. Instead, I'd probably try to borrow some algorithms from virtual memory literature in order to predict cache hit rates based on past allocation pattern, since tracking frame requests is cheap compared to storing and processing the frames themselves.

Frame allocation timing. As I noted earlier, VirtualDub prefetches multiple frames in advance in order to keep the pipelines full. Initially I had rigged the filter system to allocate result frame buffers as soon as the requests came in, and that was a huge mistake as it caused the application to exceed 300MB when all 32 frame requests immediately allocated frame buffers all through the filter chain. The key to solving this turned out to be twofold: allocate result buffers on the fly as frames are processed, and always give downstream filters priority in execution order. The combination of having the upstream filters allocate frames as late as possible and the downstream filters process and release them as soon as possible results in memory usage that is no longer proportional to the number of requests in flight. Note that this simple strategy only works if only one filter can run at a time, as in the parallel case an upstream filter can continue to run and chew through frames -- that's a bridge I'll have to cross later.

Memory allocation strategy. The easiest way to implement the frame allocators is simply to use new/malloc to allocate the frame buffers. If you try that, though, you quickly find that fragmentation and expansion of the memory heap is a problem. Early 1.9.1 builds used that strategy and the result that VirtualDub very quickly exceeded 50MB+ and stayed there even after it had shut down the filter chain. The final version uses VirtualAlloc() for allocating frames over a certain size, which largely sidesteps the problem since the OS is forced to commit and decommit pages; this is fairly easy since the buffers are fixed size, and the allocators recycle buffers to avoid excessive allocation traffic. Small buffers still go into the heap, which I may fix at some point with bundling.

I have some ideas on future work in the filter system, too. As usual, I can provide no guarantees as to if or when any of this might be implemented.

Bugs. Yeah, it's buggy in some areas, most notably filters that declare a lag (delay). The known issues will be fixed in 1.9.2 and definitely before the 1.9.x branch goes stable. (Side note: It turns out I forgot to cross-integrate the 1.8.8 fixes into 1.9.1. Oopsie.)

Multi-threading. The VirtualDub filter API doesn't allow an individual filter instance to be run in parallel, but it does allow separate instances within the same chain to execute concurrently, because filter instances never talk to each other. Current versions don't do this and serialize everything except disk I/O in the frame fetcher. I once tried to multithread the filter system by making the entire filter system thread-safe, which was a mess I'm not keen to repeat. The way I would try to do it now would be to keep the entire filter system single-threaded, including all frame management and sequencing, and only farm out individual calls to runProc() on filters.

32-bit/64-bit interop. An annoyance with 64-bit applications in general on Windows is that they can't use 32-bit DLLs. That includes codecs and filters in this case. One thing that would be nice would be the ability to use 32-bit filters from the 64-bit version. Doing this requires a mechanism for interprocess communication (IPC), as well as getting the frames across the process barrier. Copying frames through IPC is an expensive proposition, so using a shared memory mapping between the processes is likely the way to go. That requires that the frames be allocated through CreateFileMapping(), which is a big change from heap allocation, but trivial if they're currently allocated via VirtualAlloc() (see "memory allocation strategy" above).

Frame caching to disk. It'd be nice to be able to cache frames to disk when the multiple passes are required and the cost of reading and writing a frame to disk is a lot lower than computing it. Unfortunately, this is difficult to do currently because although a filter can tell what requests may be coming via the prefetch function, it isn't able to track which of those are in flight. I'd need to allow some sort of tagging system to allow this.

Avisynth compatibility. This is something that I kept in mind, although I haven't actually tried to do it. Currently Avisynth is able to run some older VirtualDub filters through a wrapper on the Avisynth side, although it doesn't support the new prefetch features. I tried to make the new API compatible such that it would be possible to make a dual mode filter that work directly as an Avisynth filter through a layer that calls the prefetch half of the filter, fetches the frames, and then runs the processing half. This layer may be something I add to the Plugin SDK at a later date. Going the other way -- VirtualDub natively running Avisynth filters -- is more difficult since the merged fetch/process GetFrame() function is incompatible with the split prefetch/process model. I had experimented in the past with using fibers to suspend and resume Avisynth filters to some success, but doing this fully requires the ability to do a late prefetch from the runProc() function and push the frame back into the waiting queue, which the filter system currently doesn't support.

3D hardware acceleration. I've wanted to come up with a general API for this for a long time, but couldn't ever come up with something I liked. Multithreading support is a preferred dependency for getting this running, but the main problems are (a) which API to use and (b) shaders. I don't like either OpenGL or Direct3D straight for 2D work as there are too many sharp corners and opportunities for API usage errors, so I'd really like to wrap it. Shaders, however, throw a huge kink into the works because I haven't found a shader compilation path that would work well. I don't like the idea of having filters embed Direct3D shader bytecode, and for various reasons I do not want to have a dependency on D3DX or Cg in the main application. GLSL is promising, but I've found that OpenGL implementations on Windows tend to be lousy in general at reporting errors.


Comments posted:

> Memory allocation strategy. The easiest way to implement the frame allocators is simply to use new/malloc to allocate the frame buffers. If you try that, though, you quickly find that fragmentation and expansion of the memory heap is a problem. Early 1.9.1 builds used that strategy and the result that VirtualDub very quickly exceeded 50MB+ and stayed there even after it had shut down the filter chain. The final version uses VirtualAlloc() for allocating frames over a certain size

Any decent malloc will use non-heap space like VirtualAlloc (or the rough equivalent in *nix, mmapping /dev/zero) for larger allocations to avoid this problem. The CRT allocator seriously doesn't do this?

Glenn Maynard - 19 04 09 - 20:57

"An annoyance with 64-bit applications in general on Windows is that they can't use 32-bit DLLs."

Do you know of any other 64-bit OS that automatically links 32-bit shared libraries in with a running 64-bit executable? I can't think of one off the top of my head.

Trimbo (link) - 19 04 09 - 23:01

@Trimbo: I think the main problem here isn't exactly that Windows can't allow a 64-bit process to link to a 32-bit library, but more a matter of running the relevant code from the DLL in a way that can easily be piped into the 64-bit processsomething a wrapper can handily do - rundll32.exe comes to mind: if memory serves me, you can't easily use rundll32.exe and a library to pipe data to a 64-bit process under Windows 64, while POSIX platforms allow a 32-bit process and a 64-bit process to communicate rather easily through pipes (not that it's fool-proof, mind you).

Mitch 74 (link) - 20 04 09 - 06:07

@Glenn Maynard:
It's actually the OS HeapAlloc() function that is responsible. I believe it does switch to dedicated heap segments for allocations at some point, but I don't know the threshold or whether it uses a different lifetime heuristic. It may also vary by OS; I was testing on XP. There's also the matter that a mallocator necessarily has to place a header at the start of the allocation, and frame sizes are frequently exactly multiples of 4K, e.g. 640x480 = 0x4B000, meaning that in Win32 you burn an extra committed 4K page and up to 64K of extra address space with a dedicated allocation.

@Mitch 74:
You can do pipes between 64-bit and 32-bit processes on Windows, as well as sockets, and shared memory (file mappings). The common problem is that all of those are mechanisms for exchanging data, not control flow. To do control flow you either have to put a marshaling layer on top or use a preexisting one. COM can do 32/64 interop, but I hate COM sufficiently enough to avoid it.

Phaeron - 21 04 09 - 02:26

If you wanted to use D3DX and the issue would be requiring to bundle it, you could use WineD3D code on Windows except ddraw.dll from Wine to replace the regular D3D9 code that VirtualDub would require. If you did want to include wine ddraw.dll, you would need to include mesa3d built as opengl32.dll. Oddly enough, Mesa can be built to use the following Win32 driver outputs: DirectX, GDI, and ICD. I do not know if GDI is hardware accelerated, but the DirectX driver might help.

Additionally, if you wanted to use GLSL and OGL entirely, using Mesa directly and having the Mesa library built to use DirectX would sidestep the error problem completely.

King InuYasha (link) - 21 04 09 - 16:38

D3D and D3DX are not the same thing. D3DX is the Direct3D utility library which contains the shader compiler, which is a component that Wine doesn't have. It doesn't normally need it, because right now everyone redistributes it as needed. For various reasons, however, I'd like to avoid this -- the least of which being that the 32-bit version alone is twice the size of my app. There have been some efforts to reimplement parts of D3DX in Wine, but they are very rudimentary at this point, such as vector library calls. I'm much farther along with VDShader, although that's still pretty far from complete, given that it doesn't support half the intrinsics or even loops.

There's no point in using Mesa when I could just use the IHV OpenGL driver and GLSL. Problem is, I'm not very hot on OpenGL at this point given its extension hell and lack of good error reporting, and GLSL would provide absolutely no way to do software emulation, which would be something I'd be interested in possibly supporting. Running GLSL in software emulation requires an entire shader compiler. In contrast, running a ps2.0 shader on the CPU requires only a relatively simple shader bytecode parser and instruction emulator.

Phaeron - 22 04 09 - 00:43

What about OpenCL? Wait, let me guess... Same issues as using Mesa/GLSL or Cg.

The only ones I can think of are CUDA, OpenCL, GLSL, HLSL, Cg, and TGSI. Personally, I would like to see the cross-platform technologies supported, but I'm doubting it. Windows does purposely have rather poor support of OpenGL in order to coerce people to use DX based technologies, and the hardware vendors now design their hardware around DirectX instead of OpenGL, tacking on OGL support at the last minute really.

I wonder though, if VirtualDub and its shader compiler was built with LLVM, would the performance get better?

King InuYasha (link) - 22 04 09 - 21:46

I haven't looked much into OpenCL, especially since publicly available support for it is still forthcoming (particularly from NVIDIA).

LLVM wouldn't help much, since it's at the back-end level -- the part of a compiler which converts internal intermediate code to usable code. The front-end, which parses the actual language to intermediate code, is the harder part. This is particularly true of pixel shader 2.0, where control flow is restricted and writing an interpreter for the bytecode is relatively straightforward. I already have a back-end JITter, so executing ps2.0 bytecode is simply a matter of hoisting the bytecode to IL.

I take back what I said about Mesa... it might actually have a GLSL-to-ARBvp/fp compiler. I'm not sure that embedding a shader compiler in VirtualDub would be a good idea, though. Seems like, uh, not core functionality.

Phaeron - 23 04 09 - 16:06

"There's also the matter that a mallocator necessarily has to place a header at the start of the allocation"

Allocations greater than a page or so almost never have a header, even if your malloc usually uses a header for small requests. An implementation can also avoid the use of headers completely by storing small allocations of different sizes in different pages (see pkhmalloc and jemalloc). They keep track of the allocation size and the free bitmap of each page in a separate structure.

Dan Nelson - 24 04 09 - 15:24

As of Wine 1.1.20, the d3dx9_*.dll implementation is totally complete ( I don't know if that is what you meant by the D3DX utility libraries, but meh. The Wine implementation of d3dx9_*.dll is different from the official ones in that all of them are stub forwarders to d3dx9_36.dll, which made it easier for developers to work with it.

Mesa does have the GLSL compiler, as Mesa is OpenGL 2.1 compatible, meaning it absolutely has to support GLSL. The problem was that the DRI drivers did not support OGL 2.1, rather they supported 1.4 with some extensions. Supposedly, Gallium3D + GEM + DRI2 + LLVM is supposed to bring GL support up to either 2.1 or as some people hope, 3.0/3.1. I personally think that OGL 2.1 is more realistic than OGL 3.0/3.1. But, we'll see.

King InuYasha (link) - 24 04 09 - 18:41

@Dan Nelson:
Yeah, I suppose a mallocator could do that, although it's slower. The Windows allocator certainly doesn't.

@King InuYasha:
Wine has DLLs that contain all of the entry points. The actual implementations of those functions is far from complete.

Phaeron - 25 04 09 - 00:47

...My forum account is still in "Validation" state, so another feature request here:

I think it is now quite possible to implement PULLDOWN in a VD filter
(Namely, in "Interpolate". Though the "Convert Frame Rate" might have been a better name, since "interpolate" is more associated with resizing photos, to my mind.)

Could you please put it in your plans?

Pulldown with at least following optopns -->

Input = Progressive [output TFF/output BFF] / Interlased [in TFF/in BFF]
Convert FR:
PAL/NTSC Film -> NTSC Video (3:2 pulldown)
PAL Video -> NTSC Video (4:2 pulldown)

...and maybe "PAL/NTSC film --> PAL Video" (24:1 pulldown ?)(with some roundings probably)...


Jam_One (link) - 02 05 09 - 08:10

Emulated OpenGL 3.0 and GLSL 1.30 on nVidia cards:

ale5000 - 02 05 09 - 15:30

(In my post above I imagined pulldown pattern in form of {{original frames : generated frames}}
(not fields)
which you may find incorrect.)

Jam_One (link) - 03 05 09 - 08:45
Televisions create their image by drawing (scanning) lines of light on the CRT face, left to right, top to bottom...

...The “top” field is the odd numbered scan lines: 1, 3, 5…, while the “bottom” field is the even numbered scan lines: 2, 4, 6...

The IVTC internal filter declares "Even field first = TFF" and "Odd field first = BFF".
So, which is the top and which is the bottom ???

(It d.o.e.s. matter, really does.)

Jam_One (link) - 03 05 09 - 14:57

The TFF/BFF settings in VirtualDub refer to the order in which fields are arranged in patterns along the time axis, not their relative position on screen. Regardless of this setting, the even field always corresponds to even numbered scanlines, with the first scan line of the even field being slightly above the first scan line of the odd field.

In the case of the IVTC filter, the TFF/BFF setting is required as the telecine patterns are different depending on field dominance.

Phaeron - 03 05 09 - 15:30

So, VirtualDub's model assumes, due to some theory, that scanlines are to be counted/numbered from the bottom up to the top ?...
Opposing the way the electronic ray travels in TV-cameras and CRT TV-sets?

Jam_One (link) - 03 05 09 - 16:23

It took me a while to figure out what you were referring to. No, VirtualDub doesn't count scan lines from the bottom up, it uses top down... but it counts starting with zero. That's probably responsible for the confusion here.

I wonder if I should just drop the even/odd part and just refer to solely top and bottom fields.

Phaeron - 05 05 09 - 01:48

it uses top down... but it counts starting with zero.

That's the idea I would probably never think of !

Thank you very much for your reply, it is really useful!
Now the VD logics is clear.


I did not care much about "exactness" of the VD's termilogy concerning fields before, when I used to deal with material captured from TV and destined to be "kinda-XviD deinterlaced output". But now a little problem occured, when I came to a need to describe Other people - which field order does my "video" have...
And the word "video" is in inverted commas for a reason that it is not quite a real video (with "flags, headers", etc.), but it is a TARGA image sequence. And since no flags or headers live in image sequence, I needed to use "human words" to understand exactly what material am I getting and "outputting"...

I see, from onother point of view it could have been "confusing" - to understand what the heck am I talking about...

Thank you once again, Phaeron !

Jam_One - 05 05 09 - 09:28

Comment form