§ ¶Why not just go to floating point?
I've been thinking of putting together a new desktop machine -- to replace the ancient Socket 754 AMD64 machine that is currently serving as a door stop -- and most likely it'd be Sandy Bridge based. The nice thing about this is that I'd then be able to experiment with Advanced Vector Extensions (AVX). Currently my main machine is a laptop with a Core i7, so the highest CPU insn set I have available is SSE 4.2. Of course, when I actually looked at AVX again, I found out to my disappointment that it's floating point only like SSE was, and the AVX2 integer version won't arrive until a future chip, which pretty much torpedoed most of the ideas I had for using it.
Why not just switch to floating point?
Well, the main reason is that it would nuke the benefit of trying to use AVX in the first place, which is higher parallelism. AVX uses 256-bit vectors instead of 128-bit vectors, so it can process twice the number of elements per operation and thus get double the throughput. However, most of the data I work with is in bytes, so going to 32-bit floats means dividing throughput by four. Multiplying by two and dividing by four doesn't work in your favor. Then there are other reasons:
- There generally isn't much flexibility in conversion to and from narrow integer formats. SSE, for instance, only really wants to convert floats to/from signed 32-bit integers, and anything else is slow to deal with.
- Floating-point operations often have higher latency.
- It takes more memory and thus more memory bandwidth.
- It's harder to safely manipulate addresses in floating-point math. (Not impossible, but the ice is thinner.)
- You can't use algorithms that require addition and subtraction to be commutative, like a moving average.
- No free saturate on add/subtract operations.
- No cheap average operation. (Hey, it's very important for some applications!)
- You have to worry about NaN disease.
It's definitely not just a question of switching to vector float types. That isn't to say there aren't advantages to going FP, of course:
- Vector divide and square root. (I once thought about implementing Photoshop's Soft Light blending mode in fixed point, but I took one look at the blending equation and said screw it, floats it is.)
- Automatic decent rounding on every operation. Managing error and rounding is a headache in most fixed point routines.
- Gradual degredation with extreme inputs, instead of catastrophic wrapping or clamping.
- Generally easier and more straightforward implementation of algorithms.
AVX does appear to have some niceities for integer routines, like 3-argument syntax, but truth be told, I haven't had too many problems with excess register moves lately. It's a bit of a bummer to go from "yeah, this would probably run much faster with 256-bit vectors" to "hmm, I'd have to convert this to floats and then it would probably run slower." :-/
Sounds like the inventor of the product does not use it himself.
tobi - 08 07 11 - 22:43
AMD has some new integer instructions with Bulldozer:
Rumbah - 09 07 11 - 01:39
"going to 32-bit floats means dividing throughput by four" - AVX has 256-bit registers, but as far as I know, byte vector arithmetic instructions are limited to 128-bit half (16 bytes only). So you are dividing your throughput by two.
Paul Jurczak - 09 07 11 - 05:52
> AMD has some new integer instructions with Bulldozer:
The SSE5/XOP instructions look far more interesting for what I do than AVX/AVX2, particularly since many operations take bytes or words. Unfortunately, being AMD only pretty much dooms them to obscurity. The same thing happened with 3DNow!, which had some useful instructions that Intel never replicated (pmulhrw, pi2fd, pf2id). :(
We also don't know how fast they will be. It looks like they haven't been extended to 256-bit, which already puts them at a throughput disadvantage. If they end up having too high of a latency or too low throughput, AVX2 might win when it comes out. Intel ran into this same problem with SSE 4.1; I'm told that a bunch of the new instructions, like the unpacked moves, aren't any faster than the old ways.
> "going to 32-bit floats means dividing throughput by four" - AVX has 256-bit registers, but as far as I know, byte vector arithmetic instructions are limited to 128-bit half (16 bytes only). So you are dividing your throughput by two.
You're prematurely jumping to the conclusion. Converting to floats by itself gives quarter throughput at the same register size.
Phaeron - 09 07 11 - 07:43
Avery, you should wait for IvyBridge CPUs if you want to use integer SIMD with AVX vector size. They are not that far away.
Regarding 1/4 of throughput in going from byte to float -- that is based on the assumption that you already had the ability to process 16-byte vectors without losing throughput which I sincerely doubt is possible.
Igor Levicki (link) - 10 07 11 - 08:31
> Regarding 1/4 of throughput in going from byte to float -- that is based on the assumption that you already had the ability to process 16-byte vectors without losing throughput which I sincerely doubt is possible.
Why do you say that? If anything it's actually easier to keep calculations moving with the integer vectors as there are fewer scheduling bottlenecks.
Phaeron - 10 07 11 - 08:45
Just checked roadmaps... if I'm not mistaken, AVX2 isn't due to come out with Ivy Bridge, but the rev after that (Haswell).
Phaeron - 10 07 11 - 08:53
For what I work with, AVX has one significant advantage - that porting SSE intrinsic code is as simple as recompiling with AVX support, to get the boost from the three parameter instructions.
Of course using the full 256bit registers will require re-writing code, but it is nice to get a small speed boost just from re-compiling.
I find that when working with floats, I really like having a min/max function available makes it significantly easier. Also integer multiplication may not always be easily doable in integer, since some of the combinations are only available after SSSE3 or SSE4.1.
Klaus Post (link) - 10 07 11 - 20:00
Anyone who tries to update his code to AVX, thinking it might get faster somehow, is going to see how unusable the instruction set turned out to be. Without the integer "promotion", as they call it, there are walls everywhere.
For image processing floats are useless anyway, 16 bit integer is all you need. SEE2 can process 8 colors components, hardly any reasons to do that with AVX and floats.
There is also an bug in vs2010, if you use the intrinsics:
Gabest - 13 07 11 - 11:11
Bleargh... I just looked at the AVX intrinsics. The existing intrinsics are bad enough to use, but with new ones line _mm256_castps128_ps256 they've managed to make intrinsics-based code even uglier. It's sad when the asm is more readable. :(
Floats aren't *quite* useless -- like I said earlier, you're going to have a hard time writing a decently accurate and performant fixed point version of an algorithm that has divides and square roots in it. At least in MMX I usually ended up going through a lookup table stage to do the divide and took a hit in accuracy. I suppose some neat lookup tricks might be possible in SSSE3... PSHUFB is kind of useful.
Phaeron - 13 07 11 - 15:53