Floating point system

Floating point system#

import sys
print(sys.float_info)

sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

print('Smallest number =', sys.float_info.min)
print('Largest  number =', sys.float_info.max)
print('2 * Unit round  =', sys.float_info.epsilon)

Smallest number = 2.2250738585072014e-308
Largest  number = 1.7976931348623157e+308
2 * Unit round  = 2.220446049250313e-16

Demonstration of unit round#

By default, Python computes in double precision where the unit round is

\[ u = 2^{-53} \]

one = 1.0
u = 2.0**(-53)

x1 = one + u
x2 = one + 1.000001*u
x3 = one + 2.0*u

print("u                = %50.40e" % u)
print("one              = %46.40f" % one)
print("one + u          = %46.40f" % x1)
print("one + 1.000001*u = %46.40f" % x2)
print("one + 2*u        = %46.40f" % x3)

u                =     1.1102230246251565404236316680908203125000e-16
one              =     1.0000000000000000000000000000000000000000
one + u          =     1.0000000000000000000000000000000000000000
one + 1.000001*u =     1.0000000000000002220446049250313080847263
one + 2*u        =     1.0000000000000002220446049250313080847263

We see that \(1+u\) is rounded down to \(1\), while \(1 + 1.000001 u\) is rounded up to the next floating point number which is \(1 + 2u\).

Compute unit round#

one   = 1.0
delta = 1.0

x = one + delta
while x > one:
    delta /= 2
    x = one + delta

print("Computed    unit round = %20.14e" % delta)
print("Theoretical unit round = %20.14e" % 2.0**(-53))
print("From python sys        = %20.14e" % (sys.float_info.epsilon/2))

Computed    unit round = 1.11022302462516e-16
Theoretical unit round = 1.11022302462516e-16
From python sys        = 1.11022302462516e-16

Representing integers in floating point system#

Not all integers can be exactly represented in floating point system. For example, in single precision, the integer

\[ 2^{23+1} + 1 = 16777217 \]

cannot be exactly represented.

from numpy import float32
print("16777215 is ", float32(16777215.0))
print("16777216 is ", float32(16777216.0))
print("16777217 is ", float32(16777217.0))
print("16777218 is ", float32(16777218.0))
print("16777219 is ", float32(16777219.0))

16777215 is  16777215.0
16777216 is  16777216.0
16777217 is  16777216.0
16777218 is  16777218.0
16777219 is  16777220.0

print("%30.25e" % float32(16777217.0))

1.6777216000000000000000000e+07

Associative law does not always hold#

In exact arithmetic

\[ (x + y) + z = x + (y + z) = (x + z) + y \]

but this does not hold in floating point arithmetic due to round-off errors.

print(1.0   + 1e-16 - 1.0)
print(1e-16 + 1.0   - 1.0)
print(1.0   - 1.0   + 1e-16)

0.0
0.0
1e-16

Only the last one is correct. The operations are performed left to right. Thus

1     + 1e-16 - 1     = (1     + 1e-16) - 1     = 1 - 1     = 0
1e-16 + 1     - 1     = (1e-16 + 1)     - 1     = 1 - 1     = 0
1     - 1     + 1e-16 = (1     - 1)     + 1e-16 = 0 + 1e-16 = 1e-16

Since 1e-16 is smaller than the unit round, 1 + 1e-16 is rounded down to 1.

More examples#

Also see examples here: github/cpraveen/na/computer_arithmetic