¶MMX throughout the years
VirtualDub, as a video program, is a very heavy user of MMX integer vector math instructions in x86 CPUs; in fact, most of the inner processing loops are almost exclusively MMX. Part of the reason is ease of coding, since it's easier to operate on an (R,G,B) triplet with one MMX instruction than with three separate ones and trying to juggle three times as many values in only eight registers. Another, though, is the significant performance gains that result.
A problem with using MMX, and its successor instruction set extensions SSE and SSE2, is that you have to pay careful attention to what CPUs support what extensions and how they perform on each. Here is a braindump on what my experiences have been throughout the years while working on VirtualDub.
The Pentium MMX:
It all started with the Pentium MMX, which introduced the first "official" vector instruction set extension to the architecture. The rules for MMX execution were as follows:
* The Pentium had a pair of execution pipes, U and V, so its peak throughput was two instructions per clock.
* Most instructions could decode and execute in either pipe in one clock.
* Only one multiply and one shift could execute per cycle.
* Multiplies had a latency of three clocks.
* Memory or integer file accesses could only execute in the U pipe, and could not pair with non-MMX instructions.
* One cycle had to pass between the last write to an MMX register and a store from it.
The rules weren't that hard to follow and thus it wasn't that hard to achieve two MMX ops per clock. The toughest parts were trying to cover the 3 clock latency on multiplies, and the store latency, which usually meant twisting the end of a loop a bit to avoid a stall on the store and also avoid an address generation interlock (AGI) at the top of the next loop.
A lot of VirtualDub's older MMX code is tuned against the Pentium MMX, such as the reduce and resize filters. MMX was a huge jump in performance for many video tasks -- some of VirtualDub's routines run three times faster with MMX than without it. I have fond memories of the Pentium MMX because the additional MMX instructions were the key to making my MPEG-1 video decoder achieve full frame rate on my 200MHz machine at the time.
The Pentium II:
Intel's Pentium II CPU brought MMX to the Pentium Pro's out-of-order architecture. It used the same MMX unit as the Pentium MMX, so the execution behavior was the same; even the 4-1-1 template didn't make much difference as all of the MMX execute instructions were 1 uop and load-execute or store instructions were 2, which gave you the same decoding behavior as the PMMX's memory-in-U restriction. The main change was that OOO execution meant that you didn't have to manually cover multiply latency anymore -- putting a dependant op right up against a multiply wasn't a guaranteed stall anymore.
I should note that optimizing for the Pentium II was frustrating compared to the Pentium, because the out-of-order architecture made it difficult to determine bottlenecks. However, this was a period of very rapid CPU power increase -- the Pentium Pro architecture started around 150MHz and made it all the way up to 1.13GHz with the Pentium III. With such a ridiculous rate of increase, real-time 320x240 soon became a no-brainer, and full-size 640x480 was a reality.
The Pentium III:
The Pentium III added SSE, which was a mix of a few integer and a bunch of floating-point vector instructions. The integer instructions were welcome, particularly shuffle, prefetch, and streaming store. A couple of new averaging instructions helped speed up MPEG decoders, but the big one was the packed sum of absolute differences instruction (psadbw), which boosted encoder performance.
The floating-point instructions, on the other hand... were a mixed bag. Part of the problem was the awkward data movement; getting any integer values smaller than 32-bit into SSE registers was a pain, and the shuffle instruction was weird. What was really bad, though, was that all 128-bit SSE ops actually executed as two 64-bit ops to the same single execution port. This meant the CPU could only decode one 128-bit instruction per cycle and only execute them every two clocks! For algorithms that could use either MMX or SSE, this meant a hefty 4:1 advantage in peak throughput in MMX's favor. In fact, except for the additional registers, SSE operations act a lot like pairs of 3DNow! instructions.
For the above reasons, and also because floating-point SSE isn't supported by AMD until all the way to the Athlon XP, I haven't used much FP SSE in VirtualDub. There is a little of it in the audio code for sample conversion, but that's it.
The Pentium 4:
Pentium 4's NetBurst architecture brings a revamped pipeline and the SSE2 instruction set to the table, the latter of which adds double-precision and integer operations to SSE. However, it also brought a set of new challenges. One is that the Pentium 4 has pretty bad latency over a wide range of instructions; another is that only one execution port can execute MMX ALU operations, instantly halving peak throughput over the Pentium II architecture. Even worse, register-to-register moves have a ridiculous six clock latency, which makes MMX dependency chains on the Pentium 4 quite long. Overall, this makes the Pentium 4 rather bad at MMX compared to the Pentium II and III.
The saving grace, however, is that SSE and SSE2 128-bit ops only take a single micro-op on P4, compared to two on PII/III. This means that 128-bit instructions can issue twice as fast as they execute, and more importantly, means that you can alternate between execution subunits. In particular, multiply and add/shift operations can overlap. When perfectly balanced this leads to two 64-bit operations per cycle and thus parity with the PII/III's peak throughput. You can usually get decent gains in a FIR loop, where you have a mix of pmaddwd and paddd instructions.
I should note that the Pentium 4's problems have to be balanced against its hefty 50% clock rate lead. It's difficult to achieve, but a Pentium 4 running at peak rate crunches pixels at a very scary rate.
The definitive resource for information on Pentium, PPro/II/III, and P4 tuning is Agner Fog's How to optimize for the Pentium microprocessors.
Pentium-M uses a revamped Pentium III architecture with support for SSE2 instructions. I don't have a Pentium-M and thus don't have tuning experience with it, although from what I hear its per-clock performance is a lot more pleasing than the Pentium 4. What would be interesting is to find out whether SSE2 actually helps or hurts the Pentium-M compared to MMX. Its decoder is improved compared to the Pentium III, but if it still has problems decoding multi-uop instructions then it may actually be faster to use MMX than SSE2.
The Athlon 64:
AMD's Athlon 64 has a burly front-end decoder, so it has many fewer decode bottlenecks than Intel CPUs. From what I can tell so far in CodeAnalyst's pipeline analysis mode, its execution performance for MMX/SSE/SSE2 code is very comparable to the Pentium II/III: two 64-bit MMX ops/clock or one 128-bit SSE2 op/clock. The one execution advantage that I know of so far is that the Athlon 64 can do two 64-bit shifts per clock, which is a dubious advantage. This isn't surprising considering that benchmarks have shown the Pentium-M's enhanced PPro architecture to be competitive with Athlons in performance.
I've only recently begun profiling VirtualDub code on Athlon 64, but for the most part it's a lot simpler: to figure out the decode clock time in a loop, take the number of instructions in the loop and divide by three. After that, keep the pipes full by breaking dependencies and balancing the execution units.