Floating Point

Review scientific notation
- [+|-]d.d*[e[+|-]dd*]
- Normalized
- 3.45x10⁴ + 4.56x10¹
A simple representation
- 1 sign bit (s)
- 5 bits for exponent, excess 16 (e) (00000^e-16 = -16, 10000^e-16 = 0, 11111^e-16 = 15)
- 8 bits for mantissa (m)
- seeeeemmmmmmmm
- (-1)^sm.mmmmmmm x 2^eeeee
- 4.125
- What is the smallest number in this system?
- What is the largest?
But when we normalize
- It is always in the form 1.b*x2^e
- So there is no need to store the 1
- New representation (letters as above)
- (-1)^s1.mmmmmmmm x 2^eeeee
- 4.125
- Look at biggest and smallest numbers again
- Look at the gap
- What did we gain? What did we loose?
IEEE 754 floating point format
- Single Precision
  - 32 bits
  - 1 sign bit
  - 8 bit excess 127 exponent
  - 23 bit mantissa
  - exponent = 11111111 => NAN (m not all 1s) or infinity (m all 1)
  - exponent = 00000000 => denormals, (-1)^s 0.m x 2 ^-126
  - See some code
  - The denormals give us a large set of numbers very close to zero
  - NANs are the result of an illeagal operation sqrt(-5)
- Double Precision
  - 64 bits
  - 1 sign bit
  - 11 bit excess 1023 exponent
  - 52 bit mantissa
- Quad Precisison
  - 128 bits
  - 1 sign bit
  - 15 bit excess 16384 exponent
  - 112 bit mantissa
- And some more code
Adding floating point numbers
- Find the larger exponent
- Shift the smaller in terms of the larger
- Add the mantissas
- Normalize if required.
- 3.26 x 10 ² + 6.04 x 10 ^-1
- Same for our format
- Add 6.5 to 12.3125 (in 14 bit fp format)
- Add 6 and 1/32 to 12.3125, notice here we have an error.