v1.10.4 (stable)

Main page
Archived news
Documentation
Capture
Compiling
Processing
Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum

Other projects
Altirra

### Calendar

 « January 2020 S M T W T F S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

## § ¶Fast noise

A user on my forum asked how to generate video noise, and I was intrigued enough to do a bit of tinkering.

The first issue to deal with is what random number generator to use. The easiest thing to do is just to use the C runtime library rand() function... until you realize how slow it is. The first reason is that it's commonly a linear congruential generator of the form:

x' = (x * A + B) mod 2^N

The multiply already isn't great, especially since it's usually a 32-bit multiply, but what makes this worse is that the seed used by rand() is shared state, which means that on a multithreaded CRT you're also taking a hit for thread-local storage. You can bypass the TLS issue easily enough just by reimplementing rand() locally, but then you run into the problem of how to vectorize this generator. 32-bit multiplication and addition are needed for this generator, and that's a pain in MMX/SSE2/3/4. And a 16-bit generator has way too short of a period.

In these cases, I prefer to use a linear feedback shift register, because they only require very simple logical operations. Here's a 31-bit generator:

x' = (x << 1) + (((x >> 27) ^ (x >> 30)) & 1);

That is, you XOR two bits in the register and shift the result in at one end. With appropriately chosen taps, this gives you a period of 2^N-1 for a register of size N. Just make sure to avoid zero, since the generator will get stuck there. LFSRs are a bit deficient in some properties typically desired for a random number generator, but that's less of an issue for noise generation than for, say, statistical sampling.

There is a problem with LFSRs, though, which is that if you use the straight formulation that is normally given for more than single bit output, you'll end up with a pretty crappy RNG in practice, because you get strings of values that are obviously just doubles or halves of the previous value due to the single bit shift. This looks like a lot of gradients if you're dealing with chunky images and a terrible "magic eye" image if you're filling bitplanes. Fortunately, it's fairly easy to generate more bits at a time, because you just do the generation in parallel:

x' = (x << 16) + (((x >> 12) ^ (x >> 15)) & 0xffff);

The straightforward way to vectorize this with SSE2 would be to run four generators in parallel in a 128-bit vector, and then extract the low 16 bits of each generator. The problem with doing this is that you spend a decent amount of time doing the masking and packing. A better way is to instead split the high and low halves of each generator into different registers, thus making both the extraction and 16 bit shift trivial:

`  ;xmm0 = high 16 bits of 8 generators  ;xmm1 = low 16 bits of 8 generators  paddw   xmm0, xmm0    ;xmm0 = hi << 1  movaps  xmm2, xmm1    ;xmm2 = lo  psrlw   xmm1, 12      ;xmm1 = lo >> 12  movaps  xmm3, xmm0    ;xmm3 = hi << 1  pxor    xmm3, xmm1    ;xmm3 = (hi << 1) ^ (lo >> 12)  psllw   xmm0, 3       ;xmm0 = hi << 4  psrlw   xmm1, 3       ;xmm1 = lo >> 15  pxor    xmm1, xmm0    ;xmm1 = (hi << 4) ^ (lo >> 15)  movaps  xmm0, xmm2    ;xmm0 = lo  pxor    xmm1, xmm3    ;xmm1 = (hi << 1) ^ (lo >> 12) ^ (hi << 4) ^ (lo >> 15)`

This generates eight 16-bit random values in 10 instructions using four registers, two being temporaries. It's also possible to generate 8-bit values from 16 parallel generators in a similar fashion, although then you'd need four registers just for the shift registers. You might wonder why I used a 31-bit generator instead of a 32-bit generator, since that would eliminate one of the shifts. That's because a 32-bit LFSR annoyingly requires four taps to hit maximum period instead of just two.

The last and trickiest part is how to preroll the eight generators so that they're all reading from different parts of the sequence. If they're too close together, they'll form an obvious pattern in the noise image. I'm guessing this may be better known in hardware engineering circles, but I had to figure this one out. Basically, you can model an N-bit LFSR as a multiplication by a NxN modulo-2 matrix:

x' = x * M

Note that in this case, scalar "multiplication" is a logical AND, and "addition" is an XOR.

Since multiplying by M advances the state by one step, multiplying by M twice advances it by two, and since matrix multiplication is associative, you can instead multiply by M^2:

x' = (x * M) = x * (M * M) = x * M^2

From here, you can then compute powers of two of M, and multiply together combinations of those results in order to quickly compute M to any desired power. This basically gives a closed-form solution to directly compute the state of the LFSR at any position, without having to iteratively step it up to 2^31 times. Not only does this make it easy to preroll the eight generators to adjacent portions of the sequence, but it also means that the noise generator can be made fully deterministic and support arbitrary seeking and decoding order.

You might have noticed that I used post-multiplication as the convention above. The reason is that this convention is often used along with a row-major storage ordering in memory, which is advantageous from a computation perspective as it turns vector-matrix multiplication into a series of parallel multiplies and adds instead of dot products. This is true as well for the binary matrices here, and in fact my vector-matrix multiply routine looks like this:

`uint32 Mult(uint32 v, const BinMat32& m) {    uint32 r = 0;`
`    for(int i=0; i<32; ++i) {        if (v & (1 << i))            r ^= m.m[i];    }`
`    return r;}`

Matrix-matrix multiplication is then just this applied to each row of the left-hand matrix. This isn't so fast that you'd do it in the inner loop, of course, but it's a lot faster than nested 31x31 loops and if done once a frame it's barely going to show a blip on a profile. The base matrix and the power of two matrices are properties of the LFSR and can be precomputed.

I implemented this as a VirtualDub filter, and it does generate noise very quickly. It still doesn't look like natural noise, though, because the spectrum isn't right. I tried the trick of adding groups of samples to approximate a normal distribution with a binomial, and it still doesn't look right due to the spatial spectrum. At that point I lost interest, but I think the next thing to try would have been to generate several octaves of noise and blend them together, which in my experience generates much more interesting noise.

## § ¶Video operations without pixel shaders

May be this is not the right place to ask, but can I implement brightness/contrast/hue/etc. correction in Direct3D 9 without using pixel shaders? I mean - what's the way this kind of thing is done? Where should I start from?

Well... yes, you can do some of those, but it's a pain.

If you don't have pixel shaders, then you don't have a lot of options. You basically have two choices for getting images on the screen, StretchRect() or drawing polygons. StretchRect() doesn't give you any control other than filtering, so that means drawing polygons, and without pixel shaders, that means the dreaded fixed function pipeline of Direct3D 7. The fixed function pixel pipeline is a cascade of parallel color and alpha stages, with diffuse and specular going in the head, a texture being injected to each color/alpha stage, and the output going to the framebuffer.

Now, if you're assuming lack of pixel shader support, that means you're looking at cards like an TNT/TNT2, GeForce 1/2, or Radeon. The NVIDIA cards have two textures and two blend stages, while the Radeon has three. Furthermore, none of these cards have dependent texture read support, so texture-based table lookups are out and you're going to need very simple algorithms. Brightness/contrast adjustments are doable in two stages as a multiply and an add or subtract, or one if you use a multiply-add operation. You can squeeze in simple color balance too by using separate scale and bias values for each channel. Saturation can be done as a linear blend between luminance and the original color, i.e. lerp(dot(luma_basis, color), color, factor), but can be tricky to set up with the weird DX7 signed dot product. I think it might be doable in one pass if you're not doing anything else and use the framebuffer blend hardware.

Hue shifts are annoying to do if you want the standard HSV/HSL way of rotating colors around a hexagonal prism -- I think it'd be tough to do in fixed function without at least a lot of passes. Simple rotation in chroma would be easier, but you need a minimum of two dot products to do that and that practically means multipassing.

There are two other ways that some of these operations can be done. If YCbCr-to-RGB conversion is also being done through the rasterizer hardware, then some of these operations can be folded into that operation via the color matrix coefficients, which is desirable for both performance and accuracy reasons. In full-screen mode, it is also possible to use gamma ramp tables, although you run into problems if you have any on-screen UI that you don't want whacked by the post-process transform.

Finally, this isn't going to help if you must target Direct3D, but it's worth noting that the NVIDIA cards of that era are considerably more powerful if driven from OpenGL with NV extensions instead of Direct3D. On a GeForce 2, you get to play with two full register combiners along with a lerp-capable final combiner instead of two pathetic texture blend stages.

Overall, using fixed function rasterizing hardware to process video is difficult and restrictive. Creativity is a requirement as even simple operations like subtraction can require ingenuity. You can gain additional flexibility by multipassing in the framebuffer or off-screen surfaces using the post-blender, but you pay dearly in terms of fill rate and possibly accuracy. I did this in order to implement bicubic stretching in fixed function, and when you're doing somewhere between 5-9 passes with blending on and all texture stages active the GPU may not have enough fill rate to run at full video frame rate.