Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Optimizing for the Intel Atom CPU, part 2

There was a lot more feedback than I expected on the Atom post (http://virtualdub.org/blog/pivot/entry.php?id=286), so time for part 2.

First, some more explanation. This routine is one of the scan line renderers in Altirra, so it does indeed execute frequently -- 240 * 60 = 14,400 times per second at peak. At ~2000 cycles per iteration, that's just under 30 million cycles, which is about 1.9% of 1.6GHz. If you think this isn't much, you're right -- except that we're talking about a CPU designed for low-power netbook use. If it were just a question of attaining target performance this wouldn't be an issue at all, because Altirra already runs at full speed on the machine in question. However, when it comes to optimizing for netbooks, or even just laptops, power consumption and heat production come into play. In other words, I hate it when my laptop becomes hot and loud, and I want programs to use less CPU even if there's more available. Reducing CPU usage further allows the CPU to drop to lower performance states more often, and 30 million cycles is more of a deal when the CPU is running at 400MHz instead of 1.6GHz. Also, as I noted, this function was #2 on the sampling profile, so if I needed to reduce CPU usage lower, this function would absolutely be on the optimization list.

The second thing I should point out is that I haven't actually committed to doing any Atom optimization for either Altirra or VirtualDub. Part of the reason I stumbled upon this was that I was just curious as to what was taking up the CPU time, and launching a sampling profile is really easy. The fact that the optimization is so annoying and most likely requires dropping to assembly is enough to give me pause, and I don't have a lot of motivation other than curiosity, since I can always use more powerful machines when I need to. This isn't to say, though, that it's not worth optimization applications in general for Atom. John pointed out in the previous post that Adobe should optimize Flash for Atom, and I would be very surprised if they weren't already heavily looking at this. Inner rendering loops in Mozilla Firefox would be another good target, as web browsing in general is one of the main uses for netbooks. (Don't ask me to do this, though -- I already have enough projects as it is!)

Now, back to our little guinea pig routine.

I mentioned some stats last time for cycle counts, but I didn't actually include the profiles. Here are the top ten functions within Altirra.exe, with Atom first and then the Core 2, according to CodeAnalyst:

[CodeAnalyst profile - Atom]

[CodeAnalyst profile - Core 2]

(I definitely need to run the ClearType Tuner on the netbook.)

I was expecting to find some gem in one of VirtualDub's conversion blitter routines, since both Altirra and VirtualDub share the same blitter code for display purposes and Altirra is a much easier to set up stress case. Unfortunately, I had forgotten that I had added 8-bit paletted hardware conversion support to the Direct3D9 display path a while back, so all that happens is that the image gets memcpy()'d up into a managed texture. Whoops. Well, in any case, the difference in the profiles is interesting. The reason that this caught my attention is that RenderLores(), which corresponds to this routine, jumped up a lot higher in the profile even though it's much more streamlined than the rest of the code, which tends to be erratic. The ANTIC core, for instance is structured as a Giant Switch Statement(tm). Therefore, it surprised me that a routine which had only a relatively simple loop fared relatively worse. Now that I think about it, though, it could also be more that the Atom fares better on branchy code.

As for actually optimizing the routine:

for(int i=0; i<w4; ++i) {
dst[0] = dst[1] = colorTable[priTable[src[0]]];
dst[2] = dst[3] = colorTable[priTable[src[1]]];
dst[4] = dst[5] = colorTable[priTable[src[2]]];
dst[6] = dst[7] = colorTable[priTable[src[3]]];
src += 4;
dst += 8;
}

I'm surprised that no one mentioned the possibility of pre-doubling the pixels in the color table, since scaled indexing is generally free on x86, i.e. [eax+ebx] is the same cost as [eax+ebx*2]. The downsides to this are the introduction of 16-bit opcodes in the instruction mix, which can sometimes be problematic due to instruction prefixes, and the increased cost of updating the color table. Doing so increases the best case time from 1644 clocks to 1536 clocks.

Stuart noted that the compiler is likely to be constrained due to aliasing concerns. That's a valid point, especially since all tables here are the same type -- uint8 *, to be exact -- and therefore changes to *dst could hit the lookup tables. The full routine actually already has __restrict on all of the pointers to try to avoid this problem, but to be honest, I've never seen a case where it made any difference in VC++'s code generation. Several people suggested hoisting into local variables to try to better control the fetch order:

uint8 s0 = src[0];
uint8 s1 = src[1];
uint8 p0 = priTable[s0];
uint8 p1 = priTable[s1];
uint8 c0 = colorTable[p0];
uint8 c1 = colorTable[p1];

However, it doesn't work. While VC++ isn't smart enough to take the hint that dst[] can't alias against the source tables, it does realize that reads without an intervening write can be reordered, and it does so -- thus deinterleaving the reads back apart again. Doh. Ordinarily this is a good idea as it allows the compiler much more freedom to rearrange calculations, but in this case it simply lets the compiler do exactly what we don't want it to do. Thus, we have the problem of a compiler that is not designed for the architecture we are targeting and is a little too smart for its own good. We're not really supposed to be mucking with the instruction scheduling because, well, that's the compiler's job.

In another valiant attempt to avoid assembly, Frank suggested doing some manual write combining:

uint32 c0 = colorTable[p0];
uint32 c1 = colorTable[p1];
*(uint32 *)&dst[0] = (c1 << 24) | (c1 << 16) | (c0 << 8) | c0;

A good try, but the disassembly once again shows sadness:

0000003C: 8B DE              mov         ebx,esi
0000003E: C1 E3 08           shl         ebx,8
00000041: 0B DE              or          ebx,esi
00000043: C1 E3 08           shl         ebx,8
00000046: 0B DA              or          ebx,edx
00000048: C1 E3 08           shl         ebx,8
0000004B: 0B DA              or          ebx,edx

Bitten again by another compiler optimization. Visual C++ recognizes strings of shifts and logical ORs and can rearrange them, which sometimes leads to powerful optimizations with bit flags. In this case, however, it ends up stringing everything into one long dependency chain on the EBX register at one instruction/clock. Don't try to use intermediate variables to rearrange the order -- the compiler will detect sequence of logical ORs and just linearize it again. Replacing the logical ORs with additions gives an even more interesting result:

*(uint32 *)&dst[0] = (c1 << 24) + (c1 << 16) + (c0 << 8) + c0;

00000039: C1 E3 10           shl         ebx,10h
0000003C: 03 DD              add         ebx,ebp
0000003E: 69 DB 01 01 00 00  imul        ebx,ebx,101h

Addition opens up a few more optimization opportunities for a few reasons. One, it's sometimes available on a few more execution ports -- I think one of the Pentium 4 versions could do logical OR on only one of the fast ALUs. Two, addition allows address generation logic to be leveraged, i.e. lea eax, [ebx+ecx]. And three, it allows multiplication tricks like this. Unfortunately, while this is a neat idea on a modern CPU, the Atom's 32-bit integer multiply has two-cycle throughput and five cycle latency, which makes this not so hot.

Kumar suggested folding together the priority and color tables into a single table. This is indeed possible, and it merges the two into a single 256 byte table, neatly removing an entire layer of dependent lookups. While a good optimization, it does have a major tradeoff, which is that it makes updating the tables more expensive because now up to 256 entries have to be updated on a color change. As it turns out, in this case while the priority table changes very rarely, the color table can change a lot more frequently, sometimes several times per scan line. This reduces the rendering span length well below the size of the lookup table, which could easily cause rebuilding the lookup table to dominate execution time. There are also frequently cases where it doesn't, however, in which case the level of optimization is raised to adaptive switching between two algorithms depending on the rate of dynamic updates. This is a bit more than a quick hack on a low-level routine, but it's an interesting and potentially lucrative idea nevertheless.

Several people mentioned the PSHUFB instruction. This instruction is a 128-bit vector instruction that does a byte permute, looking up each byte in another vector and effectively giving you 16 byte lookups into a 16 entry table. Nice trick, and potentially lucrative. The problem in this case is that both tables are too big for a single PSHUFB lookup. The color table can be done via two PSHUFB lookups and a merge, but the 256 byte priority table definitely won't fit. This is a problem because byte gathering in vector registers on x86 is painful in general and there's no good way to do it other than by essentially going through memory, which then leads us back to having to do a lot of itty bitty byte stores to construct each vector. Honestly, I can't see this winning unless the priority table can be efficiently replaced by a small number of logic functions. (The priority table derives from combinatorial logic in the GTIA, but it'd have to be something less than the several dozen gates that it takes on the real chip.) I haven't tried this myself, since, well, it's a bit more than a quick hack to test it.

John compared the optimization rules to RPG rules. I think it feels more like drawing a card from the deck in a board game: "You just read EAX after writing AL and AH. Go back three spaces and lose turn."

Finally, "A Coder" expressed some concern (horror?) over the implications of going to assembly to this. Well, the best I can say is: you identify a nice hotspot routine for assembly optimization, you test it against the original routine, you keep the old routine around for reference and porting purposes, and then you bury it under some nice, clean interface and forget it. :)

Miscellaneous other tidbits:

Comments

Comments posted:


"Adobe should optimize Flash for Atom, and I would be very surprised if they weren't already heavily looking at this."

I'd be a little surprised as they seem quite a useless company in this regard. :-)

They still haven't worked out how to make an x64 build of Flash so I'm not sure they need the distraction of Atom optimization. :-)

Aren't people always/still complaining about Flash performance on OS X, too?

They still haven't taken the simple registry fix which makes their PDF preview handler work on x64 machines and integrated it into the Adobe Reader installer. Despite years of Adobe's customers complaining, and still many doing so to this day, people still have to find my webpage and make the fix by hand, even though the fix has been served up for Adobe on a plate and is beyond simple. (One or two GUIDs are set incorrectly by their installer. Something they would've noticed if they had someone spend even a little time investigating the problem.)

Maybe it's just x64 but Adobe don't seem particularly proactive about supporting new architectures with Flash or Reader.

You're right, though. There are a lot of netbooks and people use them to browse the web, so Adobe should be optimizing for Atom.

Leo Davidson (link) - 02 11 09 - 23:22


Leo, Linux has had x64 flash for quite some time.

John Lee - 03 11 09 - 03:40


John Lee: Even more mind-boggling why they still don't release a Windows x64 version then (or fix their PDF viewer registration so it works under x64 Windows for Windows Explorer and Office 2007). :)

It's like Adobe haven't noticed 64-bit Windows exists, even though it's mainstream now, apart from Photoshop CS4.

Leo Davidson (link) - 03 11 09 - 13:13


The real annoyance on the Atom is that half of my bag of tricks just no longer works.

For example, the pmaddubsw 1,-1 trick which gave such absurd performance benefits on Conroe, Penryn, and Nehalem is now a net negative because pmaddubsw takes 5 bloody cycles.

Dark Shikari (link) - 03 11 09 - 17:23


Well, to be fair, you mentioned Core 2 (Conroe), Core 2 enhanced (Penryn), and Core i7 (Nehalem) -- not exactly a very diverse group, and realistically some of the most tolerant and balanced x86 CPUs I've seen in a long time. Throw in Pentium M, Pentium 4, and Athlon X2 and you'll run into similar problems with tricks that simply aren't optimal for all CPUs. Remember P4? The CPU with the SIX CYCLE latency for XMM register moves??? No, I'm not bitter.

I have to say, though, there's something nice about dealing with an in-order architecture, even if it's slow. Took out three instructions? Oh look, the loop runs exactly three cycles faster. It's probably why I get the urge to go do 6502 for a while, which is nice and predictable... until I get frustrated at having to write my own frigging divide routine, and then I go back to x86.

Phaeron - 03 11 09 - 18:12


I think everyone including Intel tries very hard not to remember P4.
I know I'm mean, but my response to everyone with performance problems with a P4 (particularly for multimedia it is unbelievable how much it sucks) is "you get what you deserve for buying uninformed".

Reimar - 03 11 09 - 23:57


What about working with CPU cache on Atom ? Does it behave in an elder brother, Core 2, way ?

jakor - 04 11 09 - 11:07


I haven't experimented with Atom in a context where the cache is a big deal. In this particular case, all of the data fits into a small number of lines of the L1 cache.

From what I've read, the L1 cache is 32K, a respectable size. Beyond that, I don't know what exactly you're asking about, whether it be latency, bank conflicts, cache line size, L1/L2 transfer speed, misaligned access penalty, cache line split penalty, etc. I presume it's not as powerful as the Core 2 cache because well, just about everything in the Atom is less powerful.

Phaeron - 04 11 09 - 15:57


Out of curiosity, I also recently had a look at the Atom, for float point SSE, and what amazed me most was the 5:1 ration between latency and throughput - that's a LOT!

Integer performance in SSE2 seems quite nice - with good performance - even for shuffles (except pshufb, which is quite slow)

But honestly, I don't think it will be long before they introduce out-of-order execution even on Atom, it is in the new Via 3000 series, and while it may take a bit more silicon, they may have to to raise performance. I don't think even Intel expects Atom specific optimizations, so I think the in-order Atom will be a thing of the past soon.

Klaus Post (link) - 04 11 09 - 19:35


The latency doesn't seem that unusual to me -- 3:1 for PPro-derived, ~4:1 or 5:1 for P4. Floating point on x86 historically has been about keeping the pipe full.

But yes, the Atom does actually look pretty decent at integer vector ops, with dual-issue of 128-bit ops. The scalar part is the Achilles' heel. Besides the killer AGI stalls, there's a lot of reliance on port 0.

It'll be interesting to see how Larrabee turns out, because in that case Intel is banking even more strongly on 4-way hardware threading being able to cover for lack of out-of-order.

Phaeron - 05 11 09 - 16:18


"Someone sent me an assembly language attempt which unfortunately used the LOOP instruction. LOOP is an instruction that combines a decrement on ECX and a conditional branch, which sounds great except that on half the CPUs out there it's slow, including Atom. Use sub+jnz or dec+jnz instead, and toss LOOP on the pile of now useless instructions next to JECXZ and XLAT."
Part of why, I think, was because Intel and AMD had to workaround a timing loop in Windows 95.

Yuhong bao - 08 11 09 - 11:49


And in the end, I don't think fighting the old compiler is a good idea. Try a newer compiler instead, like the latest Intel C compiler or gcc.

Yuhong bao - 08 11 09 - 12:38


I wonder how well the Profile Guided Optimization flag in Visual Studio would improve Atom performance? Based on what PGO can do, probably not much, since it looks to be optimizing more at a higher level than where this particular case needs to be optimized.

Derlin - 16 12 09 - 08:59


I tested SSE memory zeroing (8*movaps) and LOOP is significantly faster than DEC+JNZ/SUB+JNZ on my atom330 and atom570.

ano - 14 12 11 - 23:56

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.