Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Donate
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Optimizing for the Intel Atom CPU

I recently picked up a netbook with the Intel Atom CPU in it, and was pleasantly surprised by its performance. The Atom CPU is no rocket, but it does run at 1.6GHz and it wasn't too long ago that the fastest desktop CPUs were still well below 1GHz. Yeah, it's in-order... but so was the Pentium 120 that I had when I started writing VirtualDub, so big deal. Unsurprisingly, the old MPEG-1 files I used to test with still played just fine.

Now, I was a little bit more worried about Altirra, because its system requirements are higher and it has a strict real-time requirement. I was relieved to find out that it runs in real time on the Atom at around 20% of the CPU, but what was surprising was that one particular loop in the video subsystem was taking a tremendous amount of CPU time:

for(int i=0; i<w4; ++i) {
dst[0] = dst[1] = colorTable[priTable[src[0]]];
dst[2] = dst[3] = colorTable[priTable[src[1]]];
dst[4] = dst[5] = colorTable[priTable[src[2]]];
dst[6] = dst[7] = colorTable[priTable[src[3]]];
src += 4;
dst += 8;
}

What this loop does is translate from raw playfield and sprite data into 8-bit pixels, first going through a priority table and then a color table. The highest dot clock on the Atari is 7MHz (one-half color clock per pixel), but this handles the low-resolution modes which can only output at 3.5MHz, so each pixel is doubled up. This routine wasn't showing up hot on the systems I had tried previously, but on the Atom-based system it was #2 on the CodeAnalyst profile, right below the CPU core.

I hadn't done any Atom optimization before, so I dug around the usual sites for information. Everyone knows the Atom is an in-order core, so lots of branching and cache misses are bad news. However, the loop above is fairly well behaved because the priority table is small (256 bytes) and the color table is even smaller (23 bytes). Looking through the Intel optimization guide, however, this caught my eye:

12.3.2.2 Address Generation

The hardware optimizes the general case of instruction ready to execute must have data ready, and address generation precedes data being ready. If address generation encounters a dependency that needs data from another instruction, this dependency in address generation will incur a delay of 3 cycles.

This has dire consequences for any routine that does heavy table lookups. Address generation interlock (AGI) stalls are a consequence of CPU pipelining setups where address generation is performed by a separate stage ahead of the main execution stage; the benefit is that address generation can overlap execution instead of extending instruction time, but the downside is that a stall has to occur if the data isn't ready in time. In IA-32, this first became a problem in the 80486, where a one-clock stall occurred if you indexed using the result of the previous instruction. AGI stalls then became slightly more serious with the Pentium, where you then had to ensure that an instruction pair didn't generate an address from the result of the previous pair, usually by putting another pair of instructions between. The Atom has a much larger window of 3 cycles to cover, which is a lot harder when you only have eight GPRs.

But it gets worse.

The Pentium has two execution pipes that run in parallel, called the U and V pipes, both of which can execute one load or store instruction per cycle. The Atom too has two integer execution units and can execute a pair of instructions per clock at peak. However, unlike the Pentium, the Atom can only execute integer loads and stores in pipe 0. This means that not only does the Atom have a huge latency window to cover when doing table lookups, but it's also bottlenecked on only one of its two execution pipes. Yuck.

How bad is it? Well, let's look at the code that Visual C++ 2005 generated for that loop:

00000021: 0F B6 11           movzx       edx,byte ptr [ecx]
00000024: 0F B6 14 32        movzx       edx,byte ptr [edx+esi]
00000028: 8A 14 3A           mov         dl,byte ptr [edx+edi]
0000002B: 88 50 01           mov         byte ptr [eax+1],dl
0000002E: 88 10              mov         byte ptr [eax],dl
00000030: 0F B6 51 01        movzx       edx,byte ptr [ecx+1]
00000034: 0F B6 14 32        movzx       edx,byte ptr [edx+esi]
00000038: 8A 14 3A           mov         dl,byte ptr [edx+edi]
0000003B: 88 50 03           mov         byte ptr [eax+3],dl
0000003E: 88 50 02           mov         byte ptr [eax+2],dl
00000041: 0F B6 51 02        movzx       edx,byte ptr [ecx+2]
00000045: 0F B6 14 32        movzx       edx,byte ptr [edx+esi]
00000049: 8A 14 3A           mov         dl,byte ptr [edx+edi]
0000004C: 88 50 05           mov         byte ptr [eax+5],dl
0000004F: 88 50 04           mov         byte ptr [eax+4],dl
00000052: 0F B6 51 03        movzx       edx,byte ptr [ecx+3]
00000056: 0F B6 14 32        movzx       edx,byte ptr [edx+esi]
0000005A: 8A 14 3A           mov         dl,byte ptr [edx+edi]
0000005D: 88 50 07           mov         byte ptr [eax+7],dl
00000060: 88 50 06           mov         byte ptr [eax+6],dl
00000063: 83 C1 04           add         ecx,4
00000066: 83 C0 08           add         eax,8
00000069: 83 EB 01           sub         ebx,1
0000006C: 75 B3              jne         00000021

Take the Atom's behavior with regard to AGI stalls and memory access pipe restrictions into account and it's not hard to see that this is very, very bad code to execute on the Atom. It's been my experience that Visual C++ tends to make little or no effort at interleaving instructions, which is rational when you consider the behavior of many PPro-derived architectures with regard to out-of-order execution, register renaming stalls, and avoiding register spills with the pathetically low register count. On the Atom, however, it leads to very poor performance in this case because nearly all of the code is serialized in one pipe and all dependent lookups are placed back-to-back for maximum stallage.

So what can we do? Well, time to dust off the old Pentium-era U/V pipe skills:

   ;eax Temp 0
   ;ebx Temp 1
   ;ecx Source
   ;edx **Unused
   ;esi Color table
   ;edi Priority table
   ;ebp Destination
   ;esp Counter
xloop:
   add   ebp, 4         ;ALU1
   movzx eax, [ecx+1]   ;ALU0
   movzx ebx, [ecx]     ;ALU0
   add   ecx, 2         ;ALU1
   movzx eax, [esi+eax] ;ALU0
   movzx ebx, [esi+ebx] ;ALU0
   mov   al, [edi+eax]  ;ALU0
   mov   bl, [edi+ebx]  ;ALU0
   mov   ah, al         ;ALU0/1
   shl   eax, 16        ;ALU0
   mov   bh, bl         ;ALU0/1
   mov   ax, bx         ;ALU0/1
   mov   [ebp-4], eax   ;ALU0
   sub   esp, 2         ;ALU0/1
   jae   xloop          ;B

Ugly? Definitely. You might have noticed that I reused the stack pointer but left EDX free. That's because I found a way to open up that register and was trying to open up another register so that I could increase the parallelism from 2x to 4x to remove the remaining AGI stalls, but I couldn't find a way to do it. One beneficial side effect to the Atom's in-order architecture that I've leveraged here is that partial register stalls largely aren't a problem. With most modern x86 CPUs, it's become regular practice to avoid merging results in low/high byte registers such as AH/AL, because of various penalties associated with doing so. At a minimum you'd end up taking a couple of clocks of penalties, and on older P6 era CPUs it would stall the pipeline. It appears that the only partial register issue in the Atom is that you can't have simultaneously executing instructions targeting different parts of the same register. That means it's open season again on combining bytes into words for free.

Anyway, benchmarking the old routine against the new routine for 228 source pixels (456 output pixels) gives 2700 clocks for the old routine and 1644 clocks for the new routine, a ~40% improvement. The CodeAnalyst profile of Altirra shows similar improvement, so this is a real gain. Unfortunately, on the Core 2 it's the opposite story with the new routine being half as fast: 775 clocks vs. 1412 clocks. This leads me to believe that optimizing for Atom largely requires new code paths instead of merely tweaking existing ones, in order to avoid regressions on faster machines.

Is it worth optimizing for Atom? I'm not sure. Certainly there are significant gains to be made, but not all applications are suited for netbooks. An Atom-based computer would surely not be a first choice for HD video editing. Optimizing for an architecture like this in a compiler also requires a very aggressive code generator, and my experience in the Pentium era was that compilers really weren't up to the task. Current versions of Visual Studio definitely aren't; supposedly Intel C/C++ now has some support for Atom optimization, but I don't know how effective it is. There's also the question of how much multithreading can help cover for the execution delays on single-threaded code, although in some ways that feels like replacing a big problem with an even bigger problem.

For the meantime, though, it definitely seems like what's old is new again.

Comments

Comments posted:


I love using virtualdub even on my eeepc but I have no idea what you are talking about. You are way too advanced for me. Don't give yourself extra work! My eeepc is slow with virtualdub but it is ok because I was not expecting it to be fast. You rule!

Ranger - 01 11 09 - 13:52


I suppose that's one of the benefits of Atom. From a performance standpoint it sucks compared to most CPUs, but from a feature set standpoint it's pretty good: it supports up to SSSE3, which puts it at a user-space feature level of the original Core architecture. That means it's capable of running an awful lot of software, albeit slowly.

Of course, if you're a commercial software vendor, it's a bit of a nightmare to have the computing populace start moving toward CPUs that are half an order of magnitude slower, because then the spread of CPU performance you have to accommodate gets wider....

Phaeron - 01 11 09 - 15:34


I've been watching with horror as GCC has developed Atom optimizations and the open-source community has devoured them. Fedora 12 is officially rebasing from i586 to i686, because we already know i686 code is faster. (around 1% average) But then they make this decision to use Atom optimization for all i686 packages by default, (since, apparently, everyone everywhere is throwing away all their computers and buying Acer Aspire One's) and I think to myself, GREAT, my Linux desktop is going to get slower *AGAIN*. How much slower? No one knows, and no one in a position to do anything about it seems to care.

I really, really think the world would be a better place if Intel had not released the Atom.

Naptastic - 01 11 09 - 16:47


I really have to ask -- is the difference that important in your case? The difference I note here is quite dramatic, because we're going from code that is reasonably well optimized for out-of-order architectures and pessimal for Atom to the opposite case. For the vast majority of code in a typical Linux distribution, however, I would be surprised if you saw that huge of a difference. We're talking about C/C++ compiler settings, the compiler isn't going to skew the code that heavily, and the out-of-order architectures will handle Atom optimized code better than vice-versa. Any programs that require deep amounts of optimization, especially multimedia apps, are likely to have assembly language portions that are unaffected anyway.

Besides, if it matters that much to you, why not use a distro like Gentoo where you can compile with optimal settings for your system? "i686" isn't exactly a great fit, either. Pentium 4, Pentium M, and Athlon 64/X2 have some differences in optimization strategy.

Phaeron - 01 11 09 - 17:04


As a coder I learned to start simple and gain complexity by combining simple blocks. This way I can easily check that my base is flawless and move on gaining complexity without losing control.
If you look at the different modern processors, you can't really tell what your base is (Multi-Core, Single-Core, SSE1-n, MMX, Vanderpool, x64). How can anyone code something flawless on this? Somehow it looks like building skyscrapers in an earthquake-area - very arduous.

A Coder - 01 11 09 - 20:39


I use my Atom based computer as my default computed, and i do use virtualdub a lot. can't really allways expect great performance from it, but don't know if it worth the work of optimizing it.
In the next few months, dual core atom will be comming to the regular user and even ION based. So performance will catch the netbook world (as far as possible anyway)
In my opinion, when coding keep in mind the Atom processor, but don't decrease the performance of others processors that are much more important.

Pofis - 01 11 09 - 21:15


Hey there,

I find this stuff fascinating, but other than the Intel optimization manuals (and resources such as Agner Fog's docs), I know little about CPU architectures. Do you know of a good book(s) that provides a good introduction to this kind of optimization?

David

David - 02 11 09 - 00:35


Wow, in all my 20+ years of professional software engineering, I never realized the similarity between programming and playing RPGs. "If address generation encounters a dependency that needs data from another instruction, this dependency in address generation will incur a delay of 3 cycles."

John Morton - 02 11 09 - 01:15


Due to mobility reasons I was forced to switch to Atom (MSI Wind) last July from 17" with 2.0GHz Merom and GF7600. First days I thought, oh, just a toy, I just have to use that "outside". It turned out that I do almost everything on it now, save for graphics (retouching) because of small resolution screen and larger scale video work (I changed to 320GB/7200rpm, but it's still small), smaller videos are processed just fine. It has the speed of container ship - it's slow, but cruising 24/7 - so when you have it with you all the time, it can actually do a lot.

Finally I could change to new VirtualDub (I had older versions on a few machines, so for consistent results I used that old version) and Threading option (parallel encoding) is great, I have some workloads speed up like 60%. Now, if decoding was parallel, too... I have quite a lot in Lagarith and HuffYUV and sometimes it is in quite a big resolution, so decoding takes a lot of time. It could help my friends from one institute, too, AFAIK they are using VirtualDub to process (analyze) a lot of industry camera images while keeping them in some lossless compression format for disk space savings.

BTW, a question about Windows 7: I recently tested new notebook with W7HP x64 (the Acer's 3D model) and installed VirtualDub 1.9.7 on it. Then I tried installing HuffYUV, Lagarith and Panasonic DV codec, and they all failed to show in codec list in VD, but they are in System32 dir. Then I installed DivX 4.12 and it worked. Is there a limitation to codecs (32bit?) on W7 x64? Do I have to install them in some "proper" way?

And last but not least, thanks for VirtualDub, I use it for years now. You're a hero, Phaeron.

Lianna - 02 11 09 - 03:11


Lianna, are you using x64 VirtualDub with the 32bit codecs? 32-bit codecs generally aren't exposed to x64 applications.

Lianna - 02 11 09 - 05:47


Whoops, started typing in the Name filed. Above comment should've been named as "Jeff"

Jeff - 02 11 09 - 05:48


Interleaved code, yay!

Another vote for "don't optimize for Atom" from me too - there doesn't seem to be much point for it. I have one of the first eeePCs Asus brought out (the surf 4g), which has an atom underclocked at 700mhz (which I clock back to 900mhz anyway) and after the obligatory trimming down (shut down themes, services, etc etc), most stuff runs surprisingly well. And I figure that the next subnotebook I buy (if that ever happens) will probably blow this one to bits.

So, don't bother :)

ggn - 02 11 09 - 06:41


No, I used 32-bit VirtualDub. What's strange is that DivX works, just HuffYUV, Lagarith and Panasonic DV codecs don't, even though they copied their files to System32.

Lianna - 02 11 09 - 10:29


I honestly dont think there's much point coding for the Atom. Most people realize (or atleast are advised very quickly) that an Atom based machine is not going to equal a normal machine in terms of performance. Its a cheap ass internet based machine. Its not ment for coding. If you can enhance performance on that architecture without impacting performance on modern, common cores (athlon/phenom or core2/corei) then sure go for it otherwise leave atom to die and let the ultra low voltage core 2s kill it off

Chris (link) - 02 11 09 - 10:43


Your kung fu is a lot stronger than my kung fu... hats off to you sir!

George Slavov - 02 11 09 - 11:33


I think the compiler is forced to generate the original code because it can't guarantee that the arrays don't alias each other, and it can't reorder the instructions without changing the semantics of the code. You might have better luck explicitly doing 4 loads from src into local variables, then indexing all 4 into the priority table, etc etc. And if the loop is long enough, it might be worth pre-combining colorTable and priTable into a single local array of 256 bytes... that would free up another register for your assembly loop.

Very interesting post, thanks for the info on the Atom!

Stuart - 02 11 09 - 11:35


Since colorTable and priTable are loop invariants, maybe precomputing colorTable[priTable[i]] as a new table can speed up your loop better? (disclaimer: I didn't grok your assembler optimization.)

kumar - 02 11 09 - 14:16


Adobe needs to optimize Flash for Atom ASAP. Maybe you should ask for the job.

john - 02 11 09 - 15:14


Alright, first, the quickie off-topic: One issue that I've heard of on x64 editions of Windows is that some 32-bit codecs don't install properly, because they end up writing 32-bit codec entries into the 64-bit portion of the Registry tree. If you get the codec entries into the SysWOW64 section instead, they should work.

Learning CPU architecture: The book I used in college was "Computer Architecture" by Hennessy and Patterson, and I'd recommend it. It uses MIPS as the main example, but you'll learn a lot of good fundamentals such as address generation stalls and data hazards in superscalar pipelines. It's not quite as good for out-of-order renaming architectures; for that, so far Agner Fog's literature is the best I've found so far.

More responses later....

Phaeron - 02 11 09 - 16:24


Phaeron: thanks for the response. Added Computer Architecture to my Amazon list. :)

David - 02 11 09 - 23:44


"Well, let's look at the code that Visual C++ 2005 generated for that loop:"
I am not suprised, don't forget that VS2005 predates the Atom, which was released after the release of VS2008.

Yuhong Bao - 08 11 09 - 11:33


"From a performance standpoint it sucks compared to most CPUs, but from a feature set standpoint it's pretty good: it supports up to SSSE3, which puts it at a user-space feature level of the original Core architecture. That means it's capable of running an awful lot of software, albeit slowly."
Yep, some versions even support x64, which would have made this situation much better because of it's support for 16 registers and allow the low byte of all of them to be accessed directly (SIL, DIL, BPL, SPL, R8L, ...). Downside is that AH/BH/CH/DH cannot be specified with a REX prefix.

Yuhong bao - 08 11 09 - 12:21


> I am not suprised, don't forget that VS2005 predates the Atom, which was released after the release of VS2008.

Uh, care to tell which version does differently? I'm not aware of any version of Visual Studio that currently has an optimization mode for Atom.

Phaeron - 08 11 09 - 12:36


"Uh, care to tell which version does differently? I'm not aware of any version of Visual Studio that currently has an optimization mode for Atom."
VS ditched the /G? options long ago, so best to try VC2008 SP1 whith was released after the Atom.

Yuhong Bao - 08 11 09 - 12:44


I'm asking you, because you decided to comment on it as part of your usual burst of stream of consciousness comments. If you don't mind, I'd rather you didn't treat my comment form as a Twitter feed.

Phaeron - 08 11 09 - 12:56


Luckily, I have VC2008 Express SP1 installed on the Windows side of the dual-boot, so I can try it myself, but I am running the Linux side right now, so this will take some time.

Yuhong Bao - 08 11 09 - 13:25


No better, unfortunately, with VC2008 Express SP1.

Yuhong Bao - 08 11 09 - 13:48

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.