Optimizing for the Intel Atom CPU, part 2

¶Optimizing for the Intel Atom CPU, part 2

There was a lot more feedback than I expected on the Atom post (http://virtualdub.org/blog/pivot/entry.php?id=286), so time for part 2.

First, some more explanation. This routine is one of the scan line renderers in Altirra, so it does indeed execute frequently -- 240 * 60 = 14,400 times per second at peak. At ~2000 cycles per iteration, that's just under 30 million cycles, which is about 1.9% of 1.6GHz. If you think this isn't much, you're right -- except that we're talking about a CPU designed for low-power netbook use. If it were just a question of attaining target performance this wouldn't be an issue at all, because Altirra already runs at full speed on the machine in question. However, when it comes to optimizing for netbooks, or even just laptops, power consumption and heat production come into play. In other words, I hate it when my laptop becomes hot and loud, and I want programs to use less CPU even if there's more available. Reducing CPU usage further allows the CPU to drop to lower performance states more often, and 30 million cycles is more of a deal when the CPU is running at 400MHz instead of 1.6GHz. Also, as I noted, this function was #2 on the sampling profile, so if I needed to reduce CPU usage lower, this function would absolutely be on the optimization list.

The second thing I should point out is that I haven't actually committed to doing any Atom optimization for either Altirra or VirtualDub. Part of the reason I stumbled upon this was that I was just curious as to what was taking up the CPU time, and launching a sampling profile is really easy. The fact that the optimization is so annoying and most likely requires dropping to assembly is enough to give me pause, and I don't have a lot of motivation other than curiosity, since I can always use more powerful machines when I need to. This isn't to say, though, that it's not worth optimization applications in general for Atom. John pointed out in the previous post that Adobe should optimize Flash for Atom, and I would be very surprised if they weren't already heavily looking at this. Inner rendering loops in Mozilla Firefox would be another good target, as web browsing in general is one of the main uses for netbooks. (Don't ask me to do this, though -- I already have enough projects as it is!)

Now, back to our little guinea pig routine.

I mentioned some stats last time for cycle counts, but I didn't actually include the profiles. Here are the top ten functions within Altirra.exe, with Atom first and then the Core 2, according to CodeAnalyst:

[CodeAnalyst profile - Atom]

[CodeAnalyst profile - Core 2]

(I definitely need to run the ClearType Tuner on the netbook.)

I was expecting to find some gem in one of VirtualDub's conversion blitter routines, since both Altirra and VirtualDub share the same blitter code for display purposes and Altirra is a much easier to set up stress case. Unfortunately, I had forgotten that I had added 8-bit paletted hardware conversion support to the Direct3D9 display path a while back, so all that happens is that the image gets memcpy()'d up into a managed texture. Whoops. Well, in any case, the difference in the profiles is interesting. The reason that this caught my attention is that RenderLores(), which corresponds to this routine, jumped up a lot higher in the profile even though it's much more streamlined than the rest of the code, which tends to be erratic. The ANTIC core, for instance is structured as a Giant Switch Statement(tm). Therefore, it surprised me that a routine which had only a relatively simple loop fared relatively worse. Now that I think about it, though, it could also be more that the Atom fares better on branchy code.

As for actually optimizing the routine:

for(int i=0; i<w4; ++i) {
    dst[0] = dst[1] = colorTable[priTable[src[0]]];
    dst[2] = dst[3] = colorTable[priTable[src[1]]];
    dst[4] = dst[5] = colorTable[priTable[src[2]]];
    dst[6] = dst[7] = colorTable[priTable[src[3]]];
    src += 4;
    dst += 8;
}

I'm surprised that no one mentioned the possibility of pre-doubling the pixels in the color table, since scaled indexing is generally free on x86, i.e. [eax+ebx] is the same cost as [eax+ebx*2]. The downsides to this are the introduction of 16-bit opcodes in the instruction mix, which can sometimes be problematic due to instruction prefixes, and the increased cost of updating the color table. Doing so increases the best case time from 1644 clocks to 1536 clocks.

Stuart noted that the compiler is likely to be constrained due to aliasing concerns. That's a valid point, especially since all tables here are the same type -- uint8 *, to be exact -- and therefore changes to *dst could hit the lookup tables. The full routine actually already has __restrict on all of the pointers to try to avoid this problem, but to be honest, I've never seen a case where it made any difference in VC++'s code generation. Several people suggested hoisting into local variables to try to better control the fetch order:

uint8 s0 = src[0];
uint8 s1 = src[1];
uint8 p0 = priTable[s0];
uint8 p1 = priTable[s1];
uint8 c0 = colorTable[p0];
uint8 c1 = colorTable[p1];

However, it doesn't work. While VC++ isn't smart enough to take the hint that dst[] can't alias against the source tables, it does realize that reads without an intervening write can be reordered, and it does so -- thus deinterleaving the reads back apart again. Doh. Ordinarily this is a good idea as it allows the compiler much more freedom to rearrange calculations, but in this case it simply lets the compiler do exactly what we don't want it to do. Thus, we have the problem of a compiler that is not designed for the architecture we are targeting and is a little too smart for its own good. We're not really supposed to be mucking with the instruction scheduling because, well, that's the compiler's job.

In another valiant attempt to avoid assembly, Frank suggested doing some manual write combining:

uint32 c0 = colorTable[p0];
uint32 c1 = colorTable[p1];

*(uint32 *)&dst[0] = (c1 << 24) | (c1 << 16) | (c0 << 8) | c0;

A good try, but the disassembly once again shows sadness:

0000003C: 8B DE              mov         ebx,esi
0000003E: C1 E3 08           shl         ebx,8
00000041: 0B DE              or          ebx,esi
00000043: C1 E3 08           shl         ebx,8
00000046: 0B DA              or          ebx,edx
00000048: C1 E3 08           shl         ebx,8
0000004B: 0B DA              or          ebx,edx

Bitten again by another compiler optimization. Visual C++ recognizes strings of shifts and logical ORs and can rearrange them, which sometimes leads to powerful optimizations with bit flags. In this case, however, it ends up stringing everything into one long dependency chain on the EBX register at one instruction/clock. Don't try to use intermediate variables to rearrange the order -- the compiler will detect sequence of logical ORs and just linearize it again. Replacing the logical ORs with additions gives an even more interesting result:

*(uint32 *)&dst[0] = (c1 << 24) + (c1 << 16) + (c0 << 8) + c0;
 
00000039: C1 E3 10           shl         ebx,10h
0000003C: 03 DD              add         ebx,ebp
0000003E: 69 DB 01 01 00 00  imul        ebx,ebx,101h

Addition opens up a few more optimization opportunities for a few reasons. One, it's sometimes available on a few more execution ports -- I think one of the Pentium 4 versions could do logical OR on only one of the fast ALUs. Two, addition allows address generation logic to be leveraged, i.e. lea eax, [ebx+ecx]. And three, it allows multiplication tricks like this. Unfortunately, while this is a neat idea on a modern CPU, the Atom's 32-bit integer multiply has two-cycle throughput and five cycle latency, which makes this not so hot.

Kumar suggested folding together the priority and color tables into a single table. This is indeed possible, and it merges the two into a single 256 byte table, neatly removing an entire layer of dependent lookups. While a good optimization, it does have a major tradeoff, which is that it makes updating the tables more expensive because now up to 256 entries have to be updated on a color change. As it turns out, in this case while the priority table changes very rarely, the color table can change a lot more frequently, sometimes several times per scan line. This reduces the rendering span length well below the size of the lookup table, which could easily cause rebuilding the lookup table to dominate execution time. There are also frequently cases where it doesn't, however, in which case the level of optimization is raised to adaptive switching between two algorithms depending on the rate of dynamic updates. This is a bit more than a quick hack on a low-level routine, but it's an interesting and potentially lucrative idea nevertheless.

Several people mentioned the PSHUFB instruction. This instruction is a 128-bit vector instruction that does a byte permute, looking up each byte in another vector and effectively giving you 16 byte lookups into a 16 entry table. Nice trick, and potentially lucrative. The problem in this case is that both tables are too big for a single PSHUFB lookup. The color table can be done via two PSHUFB lookups and a merge, but the 256 byte priority table definitely won't fit. This is a problem because byte gathering in vector registers on x86 is painful in general and there's no good way to do it other than by essentially going through memory, which then leads us back to having to do a lot of itty bitty byte stores to construct each vector. Honestly, I can't see this winning unless the priority table can be efficiently replaced by a small number of logic functions. (The priority table derives from combinatorial logic in the GTIA, but it'd have to be something less than the several dozen gates that it takes on the real chip.) I haven't tried this myself, since, well, it's a bit more than a quick hack to test it.

John compared the optimization rules to RPG rules. I think it feels more like drawing a card from the deck in a board game: "You just read EAX after writing AL and AH. Go back three spaces and lose turn."

Finally, "A Coder" expressed some concern (horror?) over the implications of going to assembly to this. Well, the best I can say is: you identify a nice hotspot routine for assembly optimization, you test it against the original routine, you keep the old routine around for reference and porting purposes, and then you bury it under some nice, clean interface and forget it. :)

Miscellaneous other tidbits:

The AGI window isn't three instructions, but three cycles, which means up to seven instructions. That'd be easier to deal with in 64-bit mode, where there are twice as many registers, except that not all Atom models support 64-bit.
Someone sent me an assembly language attempt which unfortunately used the LOOP instruction. LOOP is an instruction that combines a decrement on ECX and a conditional branch, which sounds great except that on half the CPUs out there it's slow, including Atom. Use sub+jnz or dec+jnz instead, and toss LOOP on the pile of now useless instructions next to JECXZ and XLAT.
While looking over the instruction tables, I also noticed that BSF/BSR are also slow again. These are two frustrating instructions because they're great for bitstream parsing, but depending on the CPU they're either really fast or really slow. Atom is in the slow camp (16 clock latency). Shame... they've become more attractive now that VC++ has BitScanForward/Reverse intrinsics.
There's a new MOVBE instruction for doing an endian-swapped move. I can't find any performance information on it, though.

14 comments | Nov 02, 2009 at 17:11 | default

Current version

Navigation

Archives

¶Optimizing for the Intel Atom CPU, part 2

Comments