v1.10.4 (stable)

Main page
Archived news
Documentation
Capture
Compiling
Processing
Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info

Other projects
Altirra

Blog Archive

## ¶Fast noise

A user on my forum asked how to generate video noise, and I was intrigued enough to do a bit of tinkering.

The first issue to deal with is what random number generator to use. The easiest thing to do is just to use the C runtime library rand() function... until you realize how slow it is. The first reason is that it's commonly a linear congruential generator of the form:

x' = (x * A + B) mod 2^N

The multiply already isn't great, especially since it's usually a 32-bit multiply, but what makes this worse is that the seed used by rand() is shared state, which means that on a multithreaded CRT you're also taking a hit for thread-local storage. You can bypass the TLS issue easily enough just by reimplementing rand() locally, but then you run into the problem of how to vectorize this generator. 32-bit multiplication and addition are needed for this generator, and that's a pain in MMX/SSE2/3/4. And a 16-bit generator has way too short of a period.

In these cases, I prefer to use a linear feedback shift register, because they only require very simple logical operations. Here's a 31-bit generator:

x' = (x << 1) + (((x >> 27) ^ (x >> 30)) & 1);

That is, you XOR two bits in the register and shift the result in at one end. With appropriately chosen taps, this gives you a period of 2^N-1 for a register of size N. Just make sure to avoid zero, since the generator will get stuck there. LFSRs are a bit deficient in some properties typically desired for a random number generator, but that's less of an issue for noise generation than for, say, statistical sampling.

There is a problem with LFSRs, though, which is that if you use the straight formulation that is normally given for more than single bit output, you'll end up with a pretty crappy RNG in practice, because you get strings of values that are obviously just doubles or halves of the previous value due to the single bit shift. This looks like a lot of gradients if you're dealing with chunky images and a terrible "magic eye" image if you're filling bitplanes. Fortunately, it's fairly easy to generate more bits at a time, because you just do the generation in parallel:

x' = (x << 16) + (((x >> 12) ^ (x >> 15)) & 0xffff);

The straightforward way to vectorize this with SSE2 would be to run four generators in parallel in a 128-bit vector, and then extract the low 16 bits of each generator. The problem with doing this is that you spend a decent amount of time doing the masking and packing. A better way is to instead split the high and low halves of each generator into different registers, thus making both the extraction and 16 bit shift trivial:

`  ;xmm0 = high 16 bits of 8 generators  ;xmm1 = low 16 bits of 8 generators  paddw   xmm0, xmm0    ;xmm0 = hi << 1  movaps  xmm2, xmm1    ;xmm2 = lo  psrlw   xmm1, 12      ;xmm1 = lo >> 12  movaps  xmm3, xmm0    ;xmm3 = hi << 1  pxor    xmm3, xmm1    ;xmm3 = (hi << 1) ^ (lo >> 12)  psllw   xmm0, 3       ;xmm0 = hi << 4  psrlw   xmm1, 3       ;xmm1 = lo >> 15  pxor    xmm1, xmm0    ;xmm1 = (hi << 4) ^ (lo >> 15)  movaps  xmm0, xmm2    ;xmm0 = lo  pxor    xmm1, xmm3    ;xmm1 = (hi << 1) ^ (lo >> 12) ^ (hi << 4) ^ (lo >> 15)`

This generates eight 16-bit random values in 10 instructions using four registers, two being temporaries. It's also possible to generate 8-bit values from 16 parallel generators in a similar fashion, although then you'd need four registers just for the shift registers. You might wonder why I used a 31-bit generator instead of a 32-bit generator, since that would eliminate one of the shifts. That's because a 32-bit LFSR annoyingly requires four taps to hit maximum period instead of just two.

The last and trickiest part is how to preroll the eight generators so that they're all reading from different parts of the sequence. If they're too close together, they'll form an obvious pattern in the noise image. I'm guessing this may be better known in hardware engineering circles, but I had to figure this one out. Basically, you can model an N-bit LFSR as a multiplication by a NxN modulo-2 matrix:

x' = x * M

Note that in this case, scalar "multiplication" is a logical AND, and "addition" is an XOR.

Since multiplying by M advances the state by one step, multiplying by M twice advances it by two, and since matrix multiplication is associative, you can instead multiply by M^2:

x' = (x * M) = x * (M * M) = x * M^2

From here, you can then compute powers of two of M, and multiply together combinations of those results in order to quickly compute M to any desired power. This basically gives a closed-form solution to directly compute the state of the LFSR at any position, without having to iteratively step it up to 2^31 times. Not only does this make it easy to preroll the eight generators to adjacent portions of the sequence, but it also means that the noise generator can be made fully deterministic and support arbitrary seeking and decoding order.

You might have noticed that I used post-multiplication as the convention above. The reason is that this convention is often used along with a row-major storage ordering in memory, which is advantageous from a computation perspective as it turns vector-matrix multiplication into a series of parallel multiplies and adds instead of dot products. This is true as well for the binary matrices here, and in fact my vector-matrix multiply routine looks like this:

`uint32 Mult(uint32 v, const BinMat32& m) {    uint32 r = 0;`
`    for(int i=0; i<32; ++i) {        if (v & (1 << i))            r ^= m.m[i];    }`
`    return r;}`

Matrix-matrix multiplication is then just this applied to each row of the left-hand matrix. This isn't so fast that you'd do it in the inner loop, of course, but it's a lot faster than nested 31x31 loops and if done once a frame it's barely going to show a blip on a profile. The base matrix and the power of two matrices are properties of the LFSR and can be precomputed.

I implemented this as a VirtualDub filter, and it does generate noise very quickly. It still doesn't look like natural noise, though, because the spectrum isn't right. I tried the trick of adding groups of samples to approximate a normal distribution with a binomial, and it still doesn't look right due to the spatial spectrum. At that point I lost interest, but I think the next thing to try would have been to generate several octaves of noise and blend them together, which in my experience generates much more interesting noise.