3.1 Why Floating Point?
Fixed-point integers cannot simultaneously hold 0.000004 and 4 × 10¹². Floating point stores a number as ± mantissa × 2^exponent — sacrificing uniform precision for enormous range. IEEE 754 standardised the encoding so that all machines compute identically.
3.2 The IEEE 754 Formats
| Field | Single (32-bit) | Double (64-bit) |
|---|---|---|
| Sign (s) | 1 bit | 1 bit |
| Exponent (e) | 8 bits, bias 127 | 11 bits, bias 1023 |
| Fraction (f) | 23 bits | 52 bits |
| Value (normalized) | (−1)ˢ × 1.f × 2^(e−127) | (−1)ˢ × 1.f × 2^(e−1023) |
Two ideas carry all the marks:
- Hidden bit: a normalized binary number always starts "1." — so the leading 1 is not stored, buying one free bit of precision.
- Biased exponent: storing e + 127 instead of a signed e makes floats sortable by their raw bit patterns (comparison hardware = integer comparator). That is why bias beats 2's complement here — a classic "why" question.
3.3 Worked Encoding: −13.25 → Single Precision
- Binary: 13 = 1101₂ and 0.25 = 0.01₂, so 13.25 = 1101.01₂.
- Normalize: 1101.01 = 1.10101 × 2³ (moved the point 3 places left).
- Sign: negative → s = 1.
- Exponent: e = 3 + 127 = 130 = 1000 0010₂.
- Fraction: drop the hidden 1 → f = 10101 000…0 (padded to 23 bits).
- Assemble and group into nibbles:
1 10000010 10101000000000000000000
= 1100 0001 0101 0100 0000 0000 0000 0000
= C 1 5 4 0 0 0 0
−13.25 = 0xC1540000. ✓
3.4 Worked Decoding: 0x41C80000
0x41C80000 = 0100 0001 1100 1000 0000…₂ → s = 0; e = 1000 0011 = 131 → exponent 131 − 127 = 4; f = 1001 0…0 → mantissa 1.1001₂. Value = 1.1001 × 2⁴ = 11001₂ = 25.0. ✓
3.5 Special Values (Memorise the Table)
| e field | f field | Meaning |
|---|---|---|
| all 0s | 0 | ±0 (signed zero) |
| all 0s | ≠ 0 | denormal (no hidden bit, gradual underflow) |
| 1–254 | any | normalized number |
| all 1s | 0 | ±∞ (e.g., 1.0/0.0) |
| all 1s | ≠ 0 | NaN (e.g., 0/0, √−1) |
Ranges: single-precision normalized magnitudes span ≈1.18 × 10⁻³⁸ to ≈3.4 × 10³⁸ with ~7 decimal digits of precision; double gives ~15–16 digits.
3.6 Floating-Point Addition, Step by Step
Algorithm: (1) align — shift the mantissa of the smaller exponent right until exponents match; (2) add/subtract mantissas; (3) normalize the result (shift + adjust exponent); (4) round.
Example: 2.5 + 4.75.
2.5 = 10.1 = 1.01 x 2^1
4.75 = 100.11 = 1.0011 x 2^2
Align: 1.01 x 2^1 -> 0.1010 x 2^2
Add: 0.1010 + 1.0011 = 1.1101
Result: 1.1101 x 2^2 = 111.01 = 7.25 (already normalized)
**Why alignment shifts the smaller number:** shifting the larger one left would overflow its integer part; shifting the smaller right merely risks losing low-order bits, which rounding then handles. Multiplication is simpler: add exponents (subtract one bias!), multiply mantissas, normalize — forgetting to subtract the extra bias is the standard trap.
3.7 Rounding
IEEE 754 keeps guard, round and sticky bits beyond the mantissa during arithmetic, then applies a rounding mode:
| Mode | Rule |
|---|---|
| Round to nearest, ties to even (default) | choose nearer value; on a tie pick the even LSB — avoids statistical drift |
| Round toward 0 | truncate |
| Round toward +∞ / −∞ | directed (used in interval arithmetic) |
🎯 Exam Focus
- Encode −0.375 and 18.625 in IEEE 754 single precision (show every step and the final hex).
- Decode the single-precision pattern 0xC2290000 to decimal.
- Why does IEEE 754 use a biased exponent and a hidden mantissa bit? Give both hardware justifications.
- List the bit patterns for +0, −∞ and NaN in single precision.
- Add 9.5 and 0.6875 using binary floating-point addition; show align/add/normalize/round.
- Differentiate overflow, underflow and denormal numbers in IEEE 754.