¶GPU acceleration of video processing
I've gotten to a stable enough point that I feel comfortable in revealing what I've been working on lately, which is GPU acceleration for video filters in VirtualDub. This is something I've been wanting to try for a while. I hacked a video filter to do it a while back, but it had the severe problems of (a) only supporting RGB32, and (b) being forced to upload and download immediately around each instance. The work I've been doing in the past year to support YCbCr processing and to decouple the video filters from each other cleaned up the filter system enough that I could actually put in GPU acceleration without significantly increasing the entropy of the code base.
There are two problems with the current implementation.
The first problem is the API that it uses, which is Direct3D 9. I chose Direct3D 9 as the baseline API for several reasons:
- It's the API I'm most familiar with, by far.
- The debug runtime is much more thorough than what I've had available with other APIs.
- PIX and NVPerfHUD are free.
- It runs on just about any modern video card.
- Shaders have well-defined profiles, are portable between graphics card vendors, and use standardized byte code.
On top of this are a 3D portability layer, then the filter acceleration layer (VDXA). The API for the low level layer is designed so that it could be retargeted to Direct3D 9Ex, D3D10 and OpenGL; the VDXA layer is much more heavily restricted in feature set, but also adds easier to use 2D abstractions on top. The filter system in turn has been extended so that it inserts filters as necessary to upload or download frames from the accelerator and can initiate RGB<->YUV conversions on the graphics device. So far, so good...
...except for getting data back off the video card.
There are only two ways to download non-trivial quantities of data from the video card in Direct3D 9, which are (1) GetRenderTargetData() and (2) lock and copy. In terms of speed, the two methods are slow and pathetically slow, respectively. GetRenderTargetData() is by far the preferred method nowadays as it is decently well optimized to copy down 500MB/sec+ on any decent graphics card. The problem is that it is impossible to keep the CPU and GPU smoothly running in parallel if you use it, because it blocks the CPU until the GPU completes all outstanding commands. The result is that you spend far more time blocking on the GPU than actually doing the download and your effective throughput drops. The popular suggestion is to double-buffer render target and readback surface pairs, and as far as I can tell this doesn't help because you'll still stall on any new commands that are issued even if they go to a different render target. This means that the only way to keep the GPU busy is to sit on it with the CPU until it becomes idle, issue a single readback, and then immediately issue more commands. That sucks, and to circumvent it I'm going to have to implement another back end to see if another platform API is faster at readbacks.
The other problem is that even after loading up enough filters to ensure that readback and scheduling are not the bottlenecks, I still can't get the GPU to actually beat the CPU.
I currently have five filters accelerated: invert, deinterlace (yadif), resize, blur, blur more, and warp sharp. At full load, five out of the six are faster on the CPU by about 20-30%, and I cheated on warp sharp by implementing bilinear sampling on the GPU instead of bicubic. Part of the reason is that the CPU has less of a disadvantage on these algorithms: when dealing with 8-bit data using SSE2 it has 2-4x bandwidth than with 32-bit float data, since the narrower data types have 2-4x more parallelism in 128-bit registers. The GPU's texture cache also isn't as advantageous when the algorithm simply walks regularly over the source buffers. Finally, the systems I have for testing are a bit lopsided in terms of GPU vs. CPU power. For instance, take the back-of-the-envelope calculations for the secondary system:
- GPU (GeForce 6800): 2600Mpix/sec * 4 components/vector = 10.4 billion operations/sec
- CPU (Pentium M): 1.86GHz * 8 components / vector / clock = 14.8 billion operations/sec
It's even worse for my primary system (which I've already frequently complained about):
- GPU (Quadro NVS 140M): 3200Mpix/sec * 4 components / vector = 12.8 billion operations/sec
- CPU (Core 2): 2.5GHz * 16 components / vector / clock = 40 billion operations/sec (single core)
There are, of course, a ton of caveats in these numbers, such as memory bandwidth and the relationship between theoretical peak ops and pixel throughput. The Quadro, for instance, is only about half as fast as the GeForce in real-world benchmarks. Still, it's plausible that the CPU isn't at a disadvantage here, particularly when you consider the extra overhead in uploading and downloading frames and that some fraction of the GPU power is already used for display. I need to try a faster video card, but I don't really need one for anything else, and more importantly, I no longer have a working desktop. But then again, I could also get a faster CPU... or more cores.
The lesson here appears to be that it isn't necessarily a given that the GPU will beat the CPU, even if you're doing something that seems GPU-friendly like image processing, and particularly if you're on a laptop where the GPUs tend to be a bit underpowered. That probably explains why we haven't seen a huge explosion of GPU-accelerated apps yet, although they do exist and are increasing in number.