¶UYVY vs. YUY2
Video formats on Windows are described primarily by a "four character code," or FOURCC for short. While most describe compressed formats such as Cinepak and various MPEG-4 variants, FOURCCs are also assigned for YCbCr compressed formats that aren't natively supported by the regular GDI graphics API. These FOURCCs are used to allow video codecs to recognize their own formats for decoding purposes, as well as to allow two codecs to agree on a common interchange format.
Two common YCbCr FOURCCs are UYVY and YUY2. They are interleaved, meaning that all YCbCr components are stored in one stream. Chroma (color) information is stored at half horizontal resolution compared to luma, with chroma samples located on top of the left luma sample of each pair; luma range is 16-235 and chroma range is 16-240. The formats are named after the byte order, so for UYVY it is U (Cb), Y (luma 1), V (Cr), Y (luma 2), whereas the luma and chroma bytes are swapped for YUY2 -- Y/U/Y/V.
On Windows, it seems that YUY2 is the more common of the two formats -- Avisynth and Huffyuv prefer it, the MPEG-1 decoder lists it first, etc. Most hardware is also capable of using both formats. Ordinarily I would consider supporting only YUY2, except that the Adaptec Gamebridge device I recently acquired only supports UYVY. Now, when working with these formats in regular CPU based code, the distinction between these formats is minimal, as permuting the indices is sufficient to accommodate both. (In VirtualDub, conversion between UYVY and YUY2 is lossless.) When working with vector units, however, the difference between them can become problematic.
In my particular case, I'm looking at Direct3D-accelerated conversion of these formats to RGB, so the graphics card's vector unit is the pertinent one.
There are a few reasons I'm pursuing this path. One is that DirectDraw support on Windows Vista RTM seems to be pretty goofed up; video overlays seem to be badly broken on the current NVIDIA drivers for Vista, even with Aero Glass disabled. Second, I'm experimenting with real-time shader effects on live video, and want to eliminate the current RGB-to-YCbCr CPU-based conversion that occurs when Direct3D display is enabled in VirtualDub. Third, I've never done it before.
If you're familiar with Direct3D, you might wonder why I don't just use UYVY or YUY2 hardware support. Well, unfortunately, although YCbCr textures are supported by ATI, they're not supported on NVIDIA hardware. Both do support StretchRect() from a YCbCr surface to an RGB render target, but there are luma range problems when doing this. So it's down to pixel shaders.
Now, I have a bit of fondness for older hardware, and as such, I want this to work on the lowest pixel shader profile, pixel shader 1.1. The general idea is to upload the UYVY to YUY2 data to the video card as A8R8G8B8 data, and then convert that in the pixel shader to RGB data. The equations for converting UYVY/YUY2 data to RGB are as follows:
R = 1.164(Y-16) + 1.596(Cr-128)
G = 1.164(Y-16) - 0.813(Cr-128) - 0.391(Cb-128)
B = 1.164(Y-16) + 2.018(Cb-128)
As it turns out, this works out very well for UYVY. Cb and Cr naturally fall into the blue and red channels of the A8R8G8B8 texture; chroma green can be computed via a dot product and merged with a lerp. A little logic for selecting between the two luma samples based on even/odd horizontal position, and we're done. Heck, we can even use the bilinear filtering hardware to interpolate the chroma, too.
YUY2, however, is more annoying because Cb and Cr fall into green and alpha, respectively. Pixel shader 1.1 is very restricted in the channel manipulation available and instructions can neither swizzle the RGB channels nor write to only some of them; also, there is no dp4 instruction for including alpha in a dot product in 1.1. Just moving the scaled Cb and Cr into position consumes two of the precious eight vector instructions:
def c0, 0, 0.5045, 0, 0 ;c0.g = Cb_to_B_coeff / 4 def c1, 1, 0, 0, 0.798 ;c1.rgb = red | c1.a = Cr_to_R_coeff / 2 dp3 r0.rgb, t1_bx2, c0 ;decode Cb (green) -> chroma blue / 2 + mul r0.a, t1_bias, c1.a ;decode Cr (alpha) -> chroma red / 2 lrp r0.rgb, c1, r0.a, r0 ;merge chroma red
The net result is that so far, my YUY2 shader requires one instruction pair more than the UYVY shader. I don't know if this is significant in practice, since the underlying register combiner setup of a GeForce 3 is very different and considerably more powerful than Direct3D ps1.1 -- it can do dot(A,B)+dot(C,D) or A*B+C*D in one cycle -- but I have no idea how effective the driver is at recompiling the shader for that architecture.
(If you're willing to step up to a RADEON 8500 and ps1.4, all of this becomes moot due to availability of channel splatting, arbitrary write masks, and four-component dot product operations... but where's the fun in that!?)
It seems that, at least for vector units without cheap swizzling, UYVY is a better match for BGRA image formats than YUY2 due to the way that channels line up. I've been trying to think of where YUY2 might be more appropriate, but the best I can come up with is ABGR, which is a rare format. The other possibility is that someone was doing a weird SIMD-in-scalar trick on a CPU that involved taking advantage of the swapped channels; doing an 8 bit shift on an 80286 or 68000 would have been expensive.