¶Making a hash of floating point numbers
I've always thought that hash tables were well named, because often when you see how people have used them you wonder what they were smoking at the time. Often the problem revolves around a mistaken notion that switching a binary search tree for a hashed container bestows some sort of magical constant time lookup behavior, but sometimes it's more subtle. One case has to deal with the choice of hash function.
The hash function for a hashed container converts a key to a bucket index with the intent of trying to distribute data items as evenly as possible. Given a decent distribution for input values, the hash function for an integral key can be as simple as just using the integer value itself, with the container then applying a modulus operation to wrap it within the bucket count. Anyone who's gone down this route, however, then discovers the problem of trying to do this for a key that is of floating point type. Usually the first thing they try is something like this:
size_t operator()(float value) const {
return (size_t)(value * 100);
}
This is unfortunately usually fairly slow due to poor performance in the float-to-int conversion. There's also the matter of slightly worse behavior around zero due to truncation toward zero instead of negative infinity.
At this point, the inclination is probably to just give up and either deal with it or use a different container. Others go "aha!" and use this hack instead:
size_t operator()(float value) const {
return (size_t)*(const unsigned int *)&value;
}
This code uses the bit pattern of the float as the hash value. Yeah, it's non-portable. It's also got problems with the aliasing rules of the C language. In the not so unusual case of being able to depend on a 32-bit integral type and IEEE single precision floating point, though, it's a really neat and fast trick. And, sadly, it's also wrong. If you've done this or thought about it, don't feel bad. The .NET Framework team almost made this mistake, too.
Reason after the jump.
Negative zero is the problem. For IEEE floats, positive zero and negative zero differ by only the sign bit in bit pattern, but compare equal as floats. Using the bit pattern as the hash code therefore results in an inconsistency in the hashed container where two values compare equal but have different hash codes. I've also seen people burned by this when trying to compare floats quickly with integer compares (-0 < +0) and when implementing fast float-to-int functions (floorToInt(-0) = -1).