Current version

v1.10.4 (stable)


Main page
Archived news
Plugin SDK
Knowledge base
Contact info
Other projects



01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004


Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ CPUs and floating-point math

(Now playing: Rumbling Hearts, Kimi Ga Nozomu Eien game OST)

Some questions were asked in comments in the previous article, and I decided it would be easier to answer here instead of in comments. The two questions were about the origin of the Intel Pentium-M and about AMD's chips and 3DNow! in general.

I get the feeling that I should probably be posting links to reputable hardware sites instead of pulling this info out of my (null), but it's either this or I lament about how I wish I had eight hours a day and infinite patience to try Final Fantasy XI Online again. Let me know if I've made dumb errors in the writeup.

About the Pentium-M:

IIRC, the original Pentium-M design (Banias) was not acquired by Intel; it was created by an Intel design team in Israel based on the Pentium III design. Among the improvements were to slow down parts of the chip that were too fast, to save on gates and power, and a quadrant scheme for a power efficient L2 cache. Banias also got SSE2 support and an upgraded decoder with micro-ops fusion, which as I understand it means that the D1 and D2 decoders, which could previously only decode 1 uop instructions, can now decode load+store ops. As I said I don't have a Pentium-M to play with, but everything I seen so far indicates that the P-M team is kicking major @&(#$* which explains why they are being given the reins from the Pentium 4 team. Props also have to be given to the original Pentium Pro designers, whose basic design still lives on!

About AMD chips:

I completely skipped the Athlon and Athlon XP series of chips; they looked interesting, but what really put me off was the bad series of support chipsets, most notably the VIA north/southbridges. Stability problems are hellish to diagnose and the last thing I needed or wanted was to put up with hardware conflicts. Also, I've never been much of a high-end gamer and the 3D card I had at the time was an NVIDIA TNT2 Ultra, so I didn't really need the CPU speed anyway.

I did used to have an AMD K6 233 (that was underclocked to 200MHz) in my Linux server. Tough chip. One day the server started crashing, and I thought it was my packet shaper changes, so I tried recompiling them out of the kernel, and the sucker kept sig11ing in gcc. I ran memtest86 for a while to no avail, until I realized the top of the case was rather warm... and found out the CPU fan had stopped turning. Whoops. A new CPU fan later, the system still works to this day.

Now, about floating point and 3DNow!:

(The reference for this section is Paul Hsieh's 6th generation x86 CPU Comparisons,

Roll back in time to the days of around 300MHz. There was no question that in the x86 world, Intel was blowing everyone else away in FPU performance. The Pentium and Pentium II FPUs were fully pipelined and could pump out many results at a rate of one per clock; K6 was further back at one per every two clocks, and Cyrix was at a distant third with somewhere between 4-8 clocks per result. Part of the problem was the annoying x87 stack architecture, which required (requires) you to write optimized code like this:

    fld   x0          ;x0
    fld   y0          ;y0 x0
    fld   z0          ;z0 y0 x0
    fld   w0          ;w0 z0 y0 x0
    fld   mat00       ;mat00 w0 z0 y0 x0
    fmul  st, st(4)   ;(mat00*x0) w0 z0 y0 x0
    fld   mat01       ;mat01 (mat00*x0) w0 z0 y0 x0
    fmul  st, st(4)   ;(mat01*y0) (mat00*x0) w0 z0 y0 x0
    fld   mat02       ;mat02 (mat01*y0) (mat00*x0) w0 z0 y0 x0
    fmul  st, st(4)   ;(mat02*z0) (mat01*y0) (mat00*x0) w0 z0 y0 x0
    fld   mat03       ;mat03 (mat02*z0) (mat01*y0) (mat00*x0) w0 z0 y0 x0
    fmul  st, st(4)   ;(mat03*w0) (mat02*z0) (mat01*y0) (mat00*x0) w0 z0 y0 x0
    fxch  st(2)       ;(mat01*y0) (mat02*z0) (mat03*w0) (mat00*x0) w0 z0 y0 x0
    faddp st(3), st   ;(mat02*z0) (mat03*w0) (mat00*x0+mat01*y0) w0 z0 y0 x0
    fadd              ;(mat02*z0+mat03*w0) (mat00*x0+mat01*y0) w0 z0 y0 x0
    fld   mat10       ;mat10 (mat02*z0+mat03*w0) (mat00*x0+mat01*y0) w0 z0 y0 x0
    fmul  st, st(6)   ;(mat10*x0) (mat02*z0+mat03*w0) (mat00*x0+mat01*y0) w0 z0 y0 x0
    fld   mat11       ;mat11 (mat10*x0) (mat02*z0+mat03*w0) (mat00*x0+mat01*y0) w0 z0 y0 x0
    fmul  st, st(6)   ;(mat11*y0) (mat10*x0) (mat02*z0+mat03*w0) (mat00*x0+mat01*y0) w0 z0 y0 x0
    fxch  st(2)       ;(mat02*z0+mat03*w0) (mat10*x0) (mat11*y0) (mat00*x0+mat01*y0) w0 z0 y0 x0
    faddp st(3), st   ;(mat10*x0) (mat11*y0) x_result w0 z0 y0 x0
    fld   mat12       ;mat12 (mat10*x0) (mat11*y0) x_result w0 z0 y0 x0
    fmul  st, st(5)   ;(mat12*z0) (mat10*x0) (mat11*y0) x_result w0 z0 y0 x0
    fxch  st(2)       ;(mat10*x0) (mat11*y0) (mat12*z0) x_result w0 z0 y0 x0
    fadd              ;(mat10*x0+mat11*y0) (mat12*z0) x_result w0 z0 y0 x0
    fld   mat13       ;mat13 (mat10*x0+mat11*y0) (mat12*z0) x_result w0 z0 y0 x0
    fmul  st, st(4)   ;(mat13*z0) (mat10*x0+mat11*y0) (mat12*z0) x_result w0 z0 y0 x0
    fxch  st(3)       ;x_result (mat10*x0+mat11*y0) (mat12*z0) (mat13*z0) w0 z0 y0 x0
    fstp  x_result    ;(mat10*x0+mat11*y0) (mat12*z0) (mat13*z0) w0 z0 y0 x0

You're probably cringing upon seeing this, and rightfully so -- the stack-oriented code was difficult for CPUs to execute and for compilers to schedule, and error-prone to write by hand.

Enter 3DNow!.

3DNow! was first available on AMD K6-2 CPUs and overlaid 2-vector, single-precision floating-point operations on top of the MMX registers. This had several advantages:

* No stupid stack, just flat registers.
* One-clock throughput instead of two, *AND* two results at a time, for a 4x peak improvement.
* Really fast reciprocal and reciprocal square root approximations, and horizontal add instructions.
* Mixed vector integer (MMX) and floating-point (3DNow!) instructions. Hello, software texture mapping and goofy float bit pattern tricks!

This meant that a K6-2 could now outrun the Pentium II in floating-point performance, given appropriately optimized code. There was just one big problem.

Intel CPUs didn't support 3DNow!.

Now, I own an Athlon 64 and admire the performance of AMD architectures, but let's be clear. There were, and are, a lot more Intel CPUs out there than AMD CPUs. This meant that not only was there the problem of adoption lag -- it took a while for MMX and SSE to be used, as well -- but most of the CPUs out there didn't have 3DNow! at all. This, combined with the fact that using 3DNow! meant writing yet another CPU specific path (there were no compiler intrinsics for it at the time) didn't bode well for its adoption. Also, much like SSE, MMX held a 4:1 throughput advantage over 3DNow!, so it wasn't worth using if you didn't need floating-point range or accuracy.

I've written some sound code in 3DNow! before, and it's pretty nice -- the 2-vector form is just right for stereo audio. The main problem is that you tend to run out of registers pretty quickly because audio filters are generally longer than video filters. With some creative register juggling you can compute a 12-point IIR filter entirely in SSE registers, but there just isn't enough register space with 3DNow!.

AMD eventually got tired of their chip sucking at 3D and came out with the Athlon, which unlike the K6, was a monster in floating-point: instead of one result per two cycles, the Athlon could produce two results per cycle... on scalar x87 code. I suspect that at this point the temptation to use 3DNow! simply drained away, because it was a lot easier to use the Athlon's muscle on the x87 unit, where you didn't have to worry about CPU-specific code or FPU/MMX register file switches.

Floating-point support is still quite annoying in the x86 CPU world due to the various mismatches. The Intel Pentium III supports SSE but not 3DNow!, much to the annoyance of AMD. AMD's original Athlon is available up to 1.4GHz and supports 3DNow! but not SSE, much to the annoyance of Intel. The Pentium 4 supports SSE but is slower at scalar SSE operations than x87, much to the annoyance of everyone else. As a result most programs simply use standard x87 and don't use the optimized FP instructions of any CPU.

The situation is a lot cleaner on AMD64 (x64), where both the Intel Xeons with EM64T and the Athlon 64 both support SSE/SSE2, which is the standard for floating-point on that platform, and Microsoft has effectively banned the use of MMX and x87 by threatening not to save/restore the FPU register file.


Comments posted:

This leads me to a question: What CPU class should program like VirtualDub be optimized primarily for? New CPUs are powerfull enough, too old ones are already rare. Are the PIII and Athlon-before-XP the right candidates to optimize programs for? Note ffdshow: Now there are noSSE/SSE/SSE2 optimisations available, allowing PII owners playing DivX on their slow machines. Priceles!!!

TomK - 02 11 04 - 03:32

Will VirtualDub ever be able to handle Mpeg-2 files? I need software that will strip audio from MPEG-2. Convert to MP3 and put it back to make the file smaller. VirtualDub does that well for MPEG-1 files

Donn - 04 11 04 - 21:27

It is too dangerous to include mp2 capabilities in virtual dub, because it can be sued and shut down for enabling piracy. Search for something called virtual dub mod, which basically takes the latest version of virtual dub and adds the controvertial features back into it.

Pierce - 05 11 04 - 18:54


1: Be careful: "MP2" means "MPEG-1 layer 2", not "MPEG-2".

2: You, uh, "pulled that out of your (null)". :) The "risk" of MPEG-2 support has absolutely nothing to do with piracy; it's simply the fact that some parts of MPEG-2 are under actively-enforced patents.

Glenn Maynard - 07 11 04 - 03:17

Er, the comment formatting is confusing. s/Donn/Pierce/ above.

Glenn Maynard - 07 11 04 - 03:18

"I suspect that at this point the temptation to use 3DNow! simply drained away, because it was a lot easier to use the Athlon's muscle on the x87 unit, where you didn't have to worry about CPU-specific code or FPU/MMX register file switches."
There was still the advantage of not having to do MMX/x87 register file switches, however.

Yuhong Bao - 07 09 08 - 00:27

Comment form