## ¶Why not just go to floating point?

I've been thinking of putting together a new desktop machine -- to replace the ancient Socket 754 AMD64 machine that is currently serving as a door stop -- and most likely it'd be Sandy Bridge based. The nice thing about this is that I'd then be able to experiment with Advanced Vector Extensions (AVX). Currently my main machine is a laptop with a Core i7, so the highest CPU insn set I have available is SSE 4.2. Of course, when I actually looked at AVX again, I found out to my disappointment that it's floating point only like SSE was, and the AVX2 integer version won't arrive until a future chip, which pretty much torpedoed most of the ideas I had for using it.

Why not just switch to floating point?

Well, the main reason is that it would nuke the benefit of trying to use AVX in the first place, which is higher parallelism. AVX uses 256-bit vectors instead of 128-bit vectors, so it can process twice the number of elements per operation and thus get double the throughput. However, most of the data I work with is in bytes, so going to 32-bit floats means dividing throughput by four. Multiplying by two and dividing by four doesn't work in your favor. Then there are other reasons:

- There generally isn't much flexibility in conversion to and from narrow integer formats. SSE, for instance, only really wants to convert floats to/from signed 32-bit integers, and anything else is slow to deal with.
- Floating-point operations often have higher latency.
- It takes more memory and thus more memory bandwidth.
- It's harder to safely manipulate addresses in floating-point math. (Not impossible, but the ice is thinner.)
- You can't use algorithms that require addition and subtraction to be commutative, like a moving average.
- No free saturate on add/subtract operations.
- No cheap average operation. (Hey, it's very important for some applications!)
- You have to worry about NaN disease.

It's definitely not just a question of switching to vector float types. That isn't to say there aren't advantages to going FP, of course:

- Vector divide and square root. (I once thought about implementing Photoshop's Soft Light blending mode in fixed point, but I took one look at the blending equation and said screw it, floats it is.)
- Automatic decent rounding on every operation. Managing error and rounding is a headache in most fixed point routines.
- Gradual degredation with extreme inputs, instead of catastrophic wrapping or clamping.
- Generally easier and more straightforward implementation of algorithms.

AVX does appear to have some niceities for integer routines, like 3-argument syntax, but truth be told, I haven't had too many problems with excess register moves lately. It's a bit of a bummer to go from "yeah, this would probably run much faster with 256-bit vectors" to "hmm, I'd have to convert this to floats and then it would probably run slower." :-/