¶Exploring CUDA
After experimenting with pixel shaders for video acceleration, I decided to give NVIDIA's CUDA a shot to see how it performed on the same hardware. As I've noted before, Direct3D 9 can be used for video acceleration, but has some performance problems in specific areas. CUDA uses a more general programming model and also provides more direct access to video driver, making it potentially more interesting for video acceleration.
The first thing I have to say about CUDA is that it's fairly easy to use. Kernels are written in C using a special nvcc pre-compiler that allows you to mix both host and device code; the precompiler automatically splits the two apart, compiling the host code using VC++ or GCC and compiling the device code separately, and then linking the two together. This means that you can compile a CUDA application from the command line as easily as a standard Hello World application. Invoking a kernel on the device is also easy:
Filter<<<M, N>>>(dy, py, dx, px, 1024, 1024);
This special syntax looks just like a function call, with the added fields in angle brackets describing the grid and block dimensions for threading purposes. This is an asynchronous call, so it's easy to do the right thing of queuing up kernel calls on the device and only waiting when you need to do data transfers. Those are also easy, because CUDA has device analogs for malloc(), free(), memcpy(), and memset() operations. The API is also nicely partitioned, with the runtime API being suitable for writing CUDA applications directly, and the explicit driver API being better for integration into frameworks. You can even use PTX assembly language if you need to check the compiler's output or generate code for CUDA directly.
My first attempt at CUDA didn't give stellar performance, but the result was reasonable:
upload = 1.62035 ms (1.61613 ms)
process = 22.40966 ms (0.03101 ms)
readback = 9.19024 ms (31.57217 ms)
This is for a 3-tap horizontal filter on a 1024x1024 monochrome image, with 32-bit integer samples. The first set of numbers is for the GPU and the ones in parentheses are for the CPU. 22ms for the blur kinda sucks as it's 47 Mpixels/sec, but the upload and readback numbers are more interesting. 1.6ms for the upload corresponds to 2.6GB/sec, which is good; 9.2ms for the download is a decent and workable 450MB/sec. Clearly there is some overhead to getting data on and off the GPU, but you don't have to go to extreme measures to crunch data. On the CPU side, the launch time for the kernel is a pleasing 0.03ms, but the upload and readback times are disappointing. The reason is that these are synchronous copies. Switching to the asynchronous APIs gives a different result:
upload = 1.60832 ms (0.01117 ms)
process = 22.39344 ms (0.03213 ms)
readback = 9.17840 ms (0.00391 ms)
With asynchronous copies, the CPU is no longer stalled at all and all API calls are fast. This bypasses the biggest problem with Direct3D 9, which is the synchronous GetRenderTargetData() call. The CUDA API gives a clue as to why GRTD() stalls, which is that asynchronous copies are only allowed if the host buffer is allocated as page-locked memory, which D3D9 probably does not support. This is a shame, given the much improved numbers above.
Now, 22ms for a simple filter operation is slow enough to be fairly useless. Initially I was willing to write this off as another casualty of my wonderfully fast video card, but fortunately that's not the case. The CUDA kernel for the filter operation looked like this:
__global__ void Filter(int *D, int Ds, int *A, int As, int width, int height) {
int x = threadIdx.x + 16*blockIdx.x;
int y = threadIdx.y + 16*blockIdx.y;
int *Dr = (int *)((char *)D + Ds * y);
int *Ar = (int *)((char *)A + As * y);if (x < width && y < height)
Dr[x] = Ar[x] + Ar[x-1] + Ar[x+1];
}
The good news is that it's basically C; the bad news is that the memory performance is abysmal. It turns out that access to global memory is very tricky in CUDA due to alignment restrictions on most hardware. If all the threads access memory in just the right order and at just the right alignment, you really fast access. Get anything wrong and every single thread makes its own uncached transaction on the bus, leading to an order of magnitude drop in performance. One of the restrictions is that the set of accesses from a group of threads (half-warp) has to start at a 64 byte boundary, which doesn't work when you're trying to access a contiguous group of elements from the same thread. Therefore, the straightforward formulation above is actually a really bad way to write a CUDA kernel as it is completely bottlenecked on memory access. You can work around this with textures, which are cached, but the addressing is more annoying since only 1D textures can be addressed by integer index. The alternative is to copy data to temporary buffers with just the right block size and stride:
__shared__ int tmp[16][48];__global__ void Filter(int *D, int Ds, int *A, int As, int width, int height) {
int tx = threadIdx.x;
int ty = threadIdx.y;
int x = tx + 16*blockIdx.x;
int y = ty + 16*blockIdx.y;
int *Dr = (int *)((char *)D + Ds * y);
int *Ar = (int *)((char *)A + As * y);int *t = tmp[ty];
t[tx] = Ar[x-16];
t[tx+16] = Ar[x];
t[tx+32] = Ar[x+16];__syncthreads();if (x < width && y < height)
Dr[x] = t[tx+15] + t[tx+16] + t[tx+17];
}
This version does an aligned move of 48 elements into shared memory before extracting 18 of them to do the filter operation, executing in 4.8ms (221Mpixel/sec) instead of 22.4ms. Notice the need for explicit synchronization -- this code is actually being run on a 16x16 block of threads and a barrier is needed since data is being trampolined between them through memory. If you think that tracking race conditions on a four-core CPU is bad, well, we have 256 threads here.
Another problem with using CUDA for image processing is that it isn't well suited for byte data. There are no byte operations other than conversions and load/store ops, so all byte values need to be widened to 32-bit ints or floats. You also don't get access to the raster-op (ROP) units, so you have to write bytes directly, and there you get bit by the conversion. The underlying PTX instruction set has a perfect instruction for doing this that does a float-to-uint8 conversion with saturation, but sadly the nvcc compiler doesn't seem to generate this. Memory access is also complicated: the obvious one-thread-per-byte setup leads to uncoalesced global accesses and lots of shared memory bank conflicts. Finally, the last insult to injury is that while there are vector types, no vector operations are defined by default and the underlying hardware is scalar anyway. The best way I've found so far is to manually unroll the kernel by four, which adds to the complexity.
In general, optimizing a CUDA kernel unfortunately doesn't appear to be easier than optimizing a CPU kernel. As I just noted, avoiding uncoalesced global memory accesses is an absolute must, and the solution often involves ugly copies to and from the on-chip shared memory, which is limited in size (16K). Second, you have to manually tune the block size and register count until everything is juuuuust right. When I was writing an ELA deinterlacer, my initial attempt took ~12ms for a 640x480, 8-bit image. I eventually got it down to 3ms, but afterward the kernel was ugly as sin. It also doesn't help that you can easily lock up the entire system if you accidentally scribble over adjacent video memory with your kernel, an event which ate the first version of this blog posting. There's no memory protection to save you here.
The first conclusion I can draw from this experience is that CUDA can offer significantly higher performance for image processing than pixel shaders given the same hardware. Extrapolating, it looks like deinterlacing a 4:2:2 SD video frame with Yadif would take about 6-8ms/frame on this video card, which is much faster than I could do with D3D9 and more than good enough for 60 fps. The second conclusion is writing the kernel to achieve good performance takes a lot of work, definitely more than an equivalent pixel shader based solution. You definitely can't take an off-the-shelf routine and just compile it for the GPU and expect to get good performance, but if you spend a lot of time massaging the memory access patterns, you can indeed get good results.