Floating Point
- Review scientific notation
- [+|-]d.d*[e[+|-]dd*]
- Normalized
- 3.45x104 + 4.56x101
- A simple representation
- 1 sign bit (s)
- 5 bits for exponent, excess 16 (e) (00000e-16 = -16, 10000e-16 = 0, 11111e-16 = 15)
- 8 bits for mantissa (m)
- seeeeemmmmmmmm
- (-1)sm.mmmmmmm x 2eeeee
- 4.125
- What is the smallest number in this system?
- What is the largest?
- But when we normalize
- It is always in the form 1.b*x2e
- So there is no need to store the 1
- New representation (letters as above)
- (-1)s1.mmmmmmmm x 2eeeee
- 4.125
- Look at biggest and smallest numbers again
- Look at the gap
- What did we gain? What did we loose?
- IEEE 754 floating point format
- Single Precision
- 32 bits
- 1 sign bit
- 8 bit excess 127 exponent
- 23 bit mantissa
- exponent = 11111111 => NAN (m not all 1s) or infinity (m all 1)
- exponent = 00000000 => denormals, (-1)s 0.m x 2 -126
- See some code
- The denormals give us a large set of numbers very close to zero
- NANs are the result of an illeagal operation sqrt(-5)
- Double Precision
- 64 bits
- 1 sign bit
- 11 bit excess 1023 exponent
- 52 bit mantissa
- Quad Precisison
- 128 bits
- 1 sign bit
- 15 bit excess 16384 exponent
- 112 bit mantissa
- And some more code
- Adding floating point numbers
- Find the larger exponent
- Shift the smaller in terms of the larger
- Add the mantissas
- Normalize if required.
- 3.26 x 10 2 + 6.04 x 10 -1
- Same for our format
- Add 6.5 to 12.3125 (in 14 bit fp format)
- Add 6 and 1/32 to 12.3125, notice here we have an error.