Bit patterns of `float` – Arthur O'Dwyer – Stuff mostly about C++ - JOYK Joy of Geek, Geek News, Link all geek

Bit patterns of float

I’ve spent too many hours repeatedly trying to find this information on the Web. Time to write it down.

On the vast majority of computers, float is 32 bits, and it follows the IEEE 754 standard (also known historically as IEC 559). If you look at reinterpret_cast<int&>(myfloat) — or, in C++20, std::bit_cast<int>(myfloat) — you’ll find that the bits go in this order:

Sign Biased exponent Mantissa 1 8 23

For example, 314.0f reinterprets into int(0x439d0000):

01000011100111010000000000000000

The biased exponent bits are 10000111, or 127+8. The mantissa bits are 00111010000000000000000. Stick the implicit leading 1 on the front and we get 1.00111012×28, or 1001110102, which is binary for 314.

Notice that this is independent of byte endianness, as long as your computer uses the same endianness for both ints and floats. If you treat 314.0f as an array of bytes, you might find that it’s 43 9d 00 00 on a big-endian machine and 00 00 9d 43 on a little-endian machine; but the float’s sign bit will always correspond to the int’s sign bit, and so on.

Now for the parts I always have trouble finding on the Web.

Zero is all-bits-zero. Its biased exponent is 00000000, or 127-127.

00000000000000000000000000000000

Any non-zero number with an all-bits-zero biased exponent is “denormal” or “subnormal”; its mantissa does not have an implicit leading 1 bit, and its effective exponent “sticks” at 2−126. Here are three denormals. The middle one’s value is 0.12×2−126, or approximately 5.88e-39.

000000000000000000000000000000010000000001000000000000000000000000000000011111111111111111111111

FLT_MIN, a.k.a. std::numeric_limits<float>::min(), is approximately 1.18e-38. Its biased exponent is 00000001, or 127-126, and its mantissa is all-bits-zero, so its value is 1.02×2−126.

00000000100000000000000000000000

Next come all the “normal” numbers. For example, the value 1.0 is represented as 1.02×2127−127, for a bit pattern of 3f800000:

00111111100000000000000000000000

and 2.0 is represented as 1.02×2128−127, for a bit pattern of 40000000:

01000000000000000000000000000000

FLT_MAX, a.k.a. std::numeric_limits<float>::max(), is approximately 3.4e+38. Its biased exponent is 11111110, or 127+127.

01111111011111111111111111111111

HUGE_VALF, a.k.a. std::numeric_limits<float>::infinity(), looks like this. Its biased exponent is all-bits-one, and its mantissa is all-bits-zero.

01111111100000000000000000000000

The following three bit-patterns are all signaling NaNs. std::numeric_limits<float>::signaling_NaN() is the middle one. A signaling NaN’s biased exponent is all-bits-one and its mantissa’s top bit is 0. The remaining 22 mantissa bits are “payload.” They can be anything except all-bits-zero (because if the mantissa were all-bits-zero, it’d be HUGE_VALF instead).

011111111000000000000000000000010111111110100000000000000000000001111111101111111111111111111111

The following two bit-patterns are both quiet NaNs. NAN, a.k.a. std::numeric_limits<float>::quiet_NaN(), is the first one. A quiet NaN’s biased exponent is all-bits-one and its mantissa’s top bit is 1. The remaining 22 mantissa bits are “payload.” They can be anything.

0111111111000000000000000000000001111111111111111111111111111111

Flip the sign bit on any of these bit-patterns and you get negative versions of all the preceding floats.

Negative zero:

10000000000000000000000000000000

Negative denormals:

1000000000000000000000000000000110000000011111111111111111111111

-FLT_MIN, a.k.a. -std::numeric_limits<float>::min(), approximately -1.18e-38:

10000000100000000000000000000000

-FLT_MAX, a.k.a. std::numeric_limits<float>::lowest(), approximately -3.4e+38.

11111111011111111111111111111111

-HUGE_VALF, a.k.a. -std::numeric_limits<float>::infinity():

11111111100000000000000000000000

Signaling NaNs with negative sign bits:

1111111110000000000000000000000111111111101111111111111111111111

Quiet NaNs with negative sign bits:

1111111111000000000000000000000011111111111111111111111111111111

Implementing IEEE 754’s `totalOrder`

IEEE 754 specifies a totalOrder predicate on floats (standardized as std::strong_order in C++20) which orders the floats like this:

Negative quiet NaNs, ordered by payload bits.
Negative signaling NaNs, ordered by payload bits.
Negative infinity.
Negative normal and denormal numbers.
Negative zero.
Positive zero.
Positive normal and denormal numbers.
Positive infinity.
Positive signaling NaNs, ordered by payload bits.
Positive quiet NaNs, ordered by payload bits.

According to Stack Overflow this is equivalent to comparing the bit patterns as if they were sign-magnitude integers (note: not ordinary two’s-complement integers)… with the caveat that negative zero should be ordered less-than positive zero, so if the sign bit was set, you should subtract 1 from the two’s-complement representation before comparing. I believe this can be implemented by the following C++20 algorithm:

constexpr std::strong_ordering totalOrder(float x, float y)
{
    int rx = std::bit_cast<int>(x);
    int ry = std::bit_cast<int>(y);
    rx = (rx < 0) ? (INT_MIN - rx - 1) : rx;
    ry = (ry < 0) ? (INT_MIN - ry - 1) : ry;
    return rx <=> ry;
}

Bit patterns of `float` – Arthur O'Dwyer – Stuff mostly about C++

Implementing IEEE 754’s `totalOrder`

Recommend

Vue第二波ref语法提案来袭这次会进入到标准吗？

“眼界大开声临其境”网易首届音视频技术大会圆满落幕

Biagioli’s Galileo, Courtier

玻璃制造商长利新材获达晨财智数亿元独家投资

阅读 MyBatis 源码：SQL 执行过程

臻络科学完成亿元B+轮融资，启明创投和斯道资本联合领投

TypeScript 字符串类型

NFT 在崛起中分化：为什么 Axie、CryptoPunks 等会成功？

速览 NFT 领域正在发生的 7 大趋势

World's Most Premium Car Designers - CEOWORLD magazine

About Joyk

Bit patterns of `float` – Arthur O'Dwyer – Stuff mostly about C++

Implementing IEEE 754’s totalOrder

Recommend

About Joyk

Implementing IEEE 754’s `totalOrder`