Floating Point Numbers are Great

28/04/2025 10 min read See markdown

IEEE 754 makes so much sense.

Introduction

Before learning about floating point representation, we should review how integers are stored in the binary format. You probably know about the binary system and all that good stuff, but here is a quick refresher.

The rightmost bit starts with 1, then each bit has a value of 2, 4, 8, etc. This is the same as the decimal system which goes like 1, 10, 100, etc. but in base 2:

| 128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |

For example, the number 5 in decimal is represented by the binary number 101, because $5 = 4 + 1 = 1(2^2) + 0(2^1) + 1(2^0)$

| 8 | 4 | 2 | 1 |
|---|---|---|---|
| 0 | 1 | 0 | 1 | = 1*4 + 1*1 = 5

Fixed Point Numbers

We can extend the idea of how we represent integers to represent fractional numbers as well. Fixed point numbers are one way of doing this. After the decimal point, we continue dividing by 2 like $1/2$ , $1/4$ , $1/8$ , etc.

For example, we can have an 8 bit fixed point number, with 4 bits dedicated to the fractional part like this:

| 8 | 4 | 2 | 1 | . | 1/2 | 1/4 | 1/8 | 1/16 |

A number like 6.75 can be represented as $0110.1100$ in this case:

| 8 | 4 | 2 | 1 | . | 1/2 | 1/4 | 1/8 | 1/16 |
|---|---|---|---|   |-----|-----|-----|------|
| 0 | 1 | 1 | 0 | . |  1  |  1  |  0  |  0   | = 4 + 2 + 0.5 + 0.25 = 6.75

Side Note

You might also notice that some numbers can’t be represented exactly. This isn’t just a limitation of binary system. This problem is present in all number systems when we have a limited number of bits.

Even our decimal system has this problem. Try to write the fraction $\frac{1}{3}$ in the decimal system. You’ll get $0.333333 \dots$ . We can never exactly write this value unless we go on forever.

This is called a rounding error.

The Problem

Fixed point numbers are a good first step, but they are very limited in terms of what we can do with them. Firstly, notice how we can only represent numbers from $0$ to $15.9375$ ( $\frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \frac{1}{16} = 0.9375$ ). Remember that with integers, 8 bits can represent numbers from $0$ to $255$ . That’s a huge loss of range.

To accomodate for this, we can dedicate more bits to the integer part and less bits to the decimal part, like so:

| 32 | 16 | 8 | 4 | 2 | 1 | . | 1/2 | 1/4 |

Now we can go from $0$ to $63.75$ . That’s a wider range, but do you notice something? Now the fractional part can only be $0.5$ , $0.25$ , or $0.75$ . We gained on the range, but lost on precision.

This is why fixed point numbers are limited. The decimal point is “fixed” in place. We can either have a large range of numbers but lose on the precision, or have a very precise fractional part but only a small range of numbers.

If the dot could move (or float…) around dynamically, we could get the best of both worlds, range and precision.

Floating Point Numbers

Floating point numbers solve this tradeoff problem the fixed point numbers have. Here, we still have a limited number of bits, but now we can place the decimal point anywhere we want.

Let’s see it with a decimal example before we move on to binary. Imagine we only have 3 digits of space given to us. We can write any digit in those slots, and we are allowed to place the decimal point whereever we want. Imagine all the numbers you can write.

If the integer part is zero, the fractional part can have three digits of precision: $.123$ , $.999$ , $.001$ etc.
If the integer part has one digit, the fractional part can have two digits of precision: $7.77$ , $1.69$ , etc.
If the integer part has two digits, the fractional part can have one digit of precision: $18.3$ , $12.9$ , etc.
If the integer part has three digits, the fractional part can have no digits of precision: $420$ , $999$ , etc.

This is why floating point numbers are useful. We can represent large numbers with low precision, or small numbers with high precision.

This idea can be formalized further using the scientific notation.

Scientific Notation

Scientific notation is a standard way of writing numbers. It is used when we want to write a number in a way that is more easily readable.

A number can be represented in scientific notation with a mantissa and an exponent.

420 = 4.20 \times 10^{2} \implies \overbrace{420}^{\rm{mantissa}} \times 10^{\overbrace{2}^{\rm{exp}}}

The mantissa should only have a single nonzero digit in the integer part, the fractional part can be anything. The exponent is used to “shift” the decimal point around. For example, $10^2$ means shift by 2 digits, that’s how $420$ is equal to $4.20 \times 10^2$

Info

This property of mantissa having a single digit in the integer part is called normalization. Numbers that don’t have this property are called denormalized or subnormal numbers. For example, $1.23 \times 10^2$ is normalized while $12.3 \times 10^{1}$ is denormalized.

Notice that a number like $0.12 \times 10$ is also denormalized because the integer part has to be nonzero in normalized numbers.

We can easily see how this can be applied to the decimal numbers we have been talking about earlier. All possibilities for the decimal point can be turned into a normalized scientific form:

$.123 \to 1.23 \times 10^{-1}$
$1.69 \to 1.69 \times 10^0$ (already normalized)
$18.3 \to 1.83 \times 10^1$
$420 \to 4.20 \times 10^2$

With this knowledge of how scientific notation works, we can understand the IEEE 754 standard.

The IEEE 754 Standard

IEEE 754 defines a standard format for how the floating point numbers should be represented in binary. It’s based on the concepts we’ve already learned: Mantissa and exponent.

The 32 bit standard allocates 1 bit to the sign, 8 bits to the exponent, and 23 bits to the mantissa.

The sign bit is for whether the number is positive or negative. 1 means negative, 0 means positive.

The next 8 bits, the exponent, is for the exponent, raised to a base of 2. Here, this exponent isn’t actually the actual exponent value of the number, but it’s shifted up by 127. This is called the bias, and we’ll see how it works in the example conversions. This is done so that we can represent both positive and negative exponents.

The last 23 bits, the mantissa, is allocated to the mantissa of the number. We actually omit the integer part and only write the fractional part. That’s because in base 2, the only digit that can be in the integer part is 1.

Tip

To understand the mantissa part a bit more, think about how it works in base 10. The normalized mantissa has to have a single nonzero digit in the integer part. That means the digits $1, 2, 3, \dots, 9$ is allowed.

But in base 2, the only nonzero digit we have is 1. So, a normalized base 2 number always starts with $1$ and we can ignore it while storing the number to save a bit.

Conversion Examples

Question

Convert $-37.8125$ to IEEE 754 format.

Let’s start with the easiest part, the sign. Number is negative, so the sign bit will be 1. That’s all.

\text{sign} = 1

Now, we should convert the number into binary. We can convert the integer and fractional parts separately like so:

$37 = 32 + 4 + 1 = 100101_2$
$0.8125 = 1/2 + 1/4 + 1/16 = 0.1101_2$

So, our number is $100101.1101_2$ . Now we need to shift the floating point until we only have a single digit in the integer part. Remember that we’re in base 2, so we multiply by powers of 2 for each digit shifted.

100101.1101_2 = 1.001011101_2 \times 2^5

Since the integer part is always 1, we drop that, and get our mantissa. We can add padding to the end to complete it to 23 bits if necessary.

\text{mantissa} = 001011101 ~ 00000000000000

We found the exponent of the number to be 5, but that’s not what we write in the IEEE 754 representation. Instead we add what we call the bias, which is 127 in the 32 bit standard. That means our exponent will actually be $5 + 127 = 132$ . Then we can convert this to binary and use it as the exponent.

5 + 127 = 132 = 10000100_2

\text{exponent} = 10000100

About the Bias

This bias is used so that we can represent both positive and negative exponents. We could’ve used something like a two’s complement system, but the reason we use bias instead is because it preserves the order between numbers. In two’s complement, comparing numbers is not as straightforward, but with bias, we can just compare the exponents like they are unsigned integers. Two’s complement is great when we are doing arithmetic with numbers, but it’s not necessary here.

Putting it all together, our number in the IEEE 754 format (32 bit) is:

\text{sign + exponent + mantissa} = 1 ~ 10000100 ~ 001011101 ~ 00000000000000

Try it Yourself

Click on the bits to toggle them, and see how the values and the final number changes. You might want to scroll horizontally to see the whole thing.

Special Values

Some bit combinations in the exponent and mantissa are special. These are used to represent values like infinity, NaN (not a number) and zero.

If the exponent and mantissa are all 0s, then the special value is 0
If the exponent is all 0s but the mantissa is nonzero, then it’s a subnormal value.
If the exponent is all 1s and the mantissa is all 0s, then the value is infinity
If the exponent is all 1s and the mantissa is nonzero, then it’s NaN (not a number)

Subnormal Numbers

In the simulation above, if you make the exponent all 0s and change the mantissa to a nonzero value, you’ll notice the number we get is different from what we expect.

This special case is called a subnormal (denormalized) numbers. These are used to represent even smaller numbers than what the standard allows.

The value calculation is still simple. You just change a few things:

We don’t add the +1 to the mantissa. If the mantissa is $101_2$ , the value is just $0.101_2$ instead of $1.101_2$ .
Normally, the real exponent of a number is calculated as $\text{exponent} - \text{bias}$ . So if the exponent is all 0s, the real exponent should be $-127$ . But subnormal numbers are a special case. Since we don’t add +1 to the mantissa, we also need to adjust the exponent to make up for this. So instead, the real exponent is calculated as $1 - \text{bias}$ , which becomes $1 - 127 = -126$ in this case.

These two changes allow us to represent subnormal numbers. Play around with the simulation to understand it better.

Other Formats of IEEE 754

We mainly talked about the 32 bit standard here, but there is also other formats. Here is a list of the 3 most common:

binary32: The 32 bit standard.
- 1 + 8 + 23 = 32
- Bias = 127
- Called the single-precision format.
- The float data type in C and C-like languages.
binary64: The 64 bit standard
- 1 + 11 + 52 = 64
- Bias = 1023
- Called the double-precision format.
- The double data type in C and C-like languages.
binary128: The 128 bit standard
- 1 + 15 + 112 = 128
- Bias = 16383

Conclusion

This is it. A way to represent a large range of numbers with a limited number of bits. This is basically how all computers represent non-integer numbers. Maybe you’ll never need this knowledge during programming (other than the fact that you can’t blame the 0.1 + 0.2 !== 0.3 “bug” on JavaScript now), but it’s really interesting to see how they came up with this standard, and how you might have even invented it yourself.

As a final challenge, try to convert the number $16777217_{10}$ to a 32-bit IEEE754 format. You might discover something interesting.