Computer representation of numbers
#%config InlineBackend.figure_format = 'svg'
from pylab import *
We are used to decimal system for counting. In general we can use any base β for the number system. On most modern computers, the base β = 2 \beta = 2 β = 2 , i.e., we use the binary number system. Any non-zero number x x x is written as
x = s ⋅ ( . a 1 a 2 … a t ) β ⋅ β e = (sign) ⋅ (fractional part) ⋅ (exponent) x = s \cdot (.a_1 a_2 \ldots a_t)_\beta \cdot \beta^e = \textrm{(sign)} \cdot
\textrm{(fractional part)} \cdot \textrm{(exponent)} x = s ⋅ ( . a 1 a 2 … a t ) β ⋅ β e = (sign) ⋅ (fractional part) ⋅ (exponent) where
s = − 1 or + 1 0 ≤ a i ≤ β − 1 L ≤ e ≤ U is an integer \begin{aligned}
s = -1 \textrm{ or } +1 \\
0 \le a_i \le \beta - 1 \\
L \le e \le U \textrm{ is an integer}
\end{aligned} s = − 1 or + 1 0 ≤ a i ≤ β − 1 L ≤ e ≤ U is an integer In base 10, the fractional part has the value
( . a 1 a 2 … a t ) β = a 1 β + a 2 β 2 + … + a t β t (.a_1 a_2 \ldots a_t)_\beta = \frac{a_1}{\beta} + \frac{a_2}{\beta^2} + \ldots +
\frac{a_t}{\beta^t} ( . a 1 a 2 … a t ) β = β a 1 + β 2 a 2 + … + β t a t We will assume that a 1 ≠ 0 a_1 \ne 0 a 1 = 0 which is called a normalized floating point system and then
1 ≤ a i ≤ β − 1 , i = 1 , 2 , … , t 1 \le a _i \le \beta-1, \qquad i=1,2,\ldots,t 1 ≤ a i ≤ β − 1 , i = 1 , 2 , … , t ( β , t , L , U ) (\beta, t, L,U) ( β , t , L , U ) specifies the arithmetic characteristics of the floating point system.
1 Chopping and rounding ¶ Most real numbers cannot be exactly represented on a computer since we have finite precision. Hence they must be represented by a nearby number in the floating point system. This requires us to truncate a real number to fit into the floating point system, which can be done in two ways: chopping and rounding.
Consider a real number
x = s ⋅ ( . a 1 a 2 … a t a t + 1 … ) β ⋅ β e , a 1 ≠ 0 x = s \cdot (.a_1 a_2 \ldots a_t a_{t+1} \ldots )_\beta \cdot \beta^e, \qquad a_1 \ne 0 x = s ⋅ ( . a 1 a 2 … a t a t + 1 … ) β ⋅ β e , a 1 = 0 Let
fl ( x ) = approximation of x in the floating point system \fl{x} = \textrm{approximation of $x$ in the floating point system} fl ( x ) = approximation of x in the floating point system chopped machine representation
fl ( x ) = s ⋅ ( . a 1 a 2 … a t ) β ⋅ β e \fl{x} = s \cdot (.a_1 a_2 \ldots a_t)_\beta \cdot \beta^e fl ( x ) = s ⋅ ( . a 1 a 2 … a t ) β ⋅ β e We throw away all digits in the fractional part that do not fit into the floating-point system.
rounded machine representation
fl ( x ) = { s ⋅ ( . a 1 a 2 … a t ) β ⋅ β e 0 ≤ a t + 1 < 1 2 β s ⋅ [ ( . a 1 a 2 … a t ) β + ( . 0 … 01 ) β ] ⋅ β e 1 2 β ≤ a t + 1 < β \fl{x} = \begin{cases}
s \cdot (.a_1 a_2 \ldots a_t)_\beta \cdot \beta^e & 0 \le a_{t+1} < \half\beta \\
s \cdot [(.a_1 a_2 \ldots a_t)_\beta + (.0\ldots01)_\beta ]\cdot \beta^e & \half\beta \le
a_{t+1} < \beta
\end{cases} fl ( x ) = { s ⋅ ( . a 1 a 2 … a t ) β ⋅ β e s ⋅ [( . a 1 a 2 … a t ) β + ( .0 … 01 ) β ] ⋅ β e 0 ≤ a t + 1 < 2 1 β 2 1 β ≤ a t + 1 < β We approximate by the nearest floating point number.
2 Error in computer representation ¶ For most real numbers x ≠ fl ( x ) x \ne \fl{x} x = fl ( x ) . Define the relative error
ϵ = ϵ ( x ) = fl ( x ) − x x \epsilon = \epsilon(x) = \frac{\fl{x} - x}{x} ϵ = ϵ ( x ) = x fl ( x ) − x We have following bounds on the relative error
− β − t + 1 ≤ ϵ ≤ 0 -\beta^{-t + 1} \le \epsilon \le 0 − β − t + 1 ≤ ϵ ≤ 0 − 1 2 β − t + 1 ≤ ϵ ≤ 1 2 β − t + 1 -\half \beta^{-t + 1} \le \epsilon \le \half \beta^{-t+1} − 2 1 β − t + 1 ≤ ϵ ≤ 2 1 β − t + 1 Thus the floating point approximation fl ( x ) \fl{x} fl ( x ) of x ∈ R x \in \re x ∈ R satisfies
fl ( x ) = x ( 1 + ϵ ) \fl{x} = x (1 + \epsilon) fl ( x ) = x ( 1 + ϵ ) This type of analysis was introduced by Wilkinson (1963). It allows us to deal precisely with the effects of rounding/chopping in computer arithmetic operations.
3 Proof for chopping ¶ Take sign s = + 1 s=+1 s = + 1 . Then fl ( x ) ≤ x \fl{x} \le x fl ( x ) ≤ x and
x − fl ( x ) = ( . 00 … 0 a t + 1 … ) β ⋅ β e x - \fl{x} = (.00\ldots 0a_{t+1}\ldots)_\beta \cdot \beta^e x − fl ( x ) = ( .00 … 0 a t + 1 … ) β ⋅ β e Let γ = β − 1 \gamma=\beta-1 γ = β − 1 (largest digit in base β). Then
0 ≤ x − fl ( x ) ≤ ( . 00 … 0 γ γ … ) β ⋅ β e = γ [ 1 β t + 1 + 1 β t + 2 + … ] β e = γ β − t − 1 1 − β − 1 β e = β − t + e \begin{align*}
0 & \le x - \fl{x} \\
& \le (.00\ldots 0 \gamma\gamma\ldots)_\beta \cdot \beta^e \\
& = \gamma \left[ \frac{1}{\beta^{t+1}} + \frac{1}{\beta^{t+2}} + \ldots \right]
\beta^e \\
& = \gamma \frac{\beta^{-t-1}}{1- \beta^{-1}} \beta^e \\
& = \beta^{-t + e}
\end{align*} 0 ≤ x − fl ( x ) ≤ ( .00 … 0 γγ … ) β ⋅ β e = γ [ β t + 1 1 + β t + 2 1 + … ] β e = γ 1 − β − 1 β − t − 1 β e = β − t + e The relative error is
0 ≤ x − fl ( x ) x ≤ β − t + e ( . a 1 a 2 … ) β ⋅ β e ≤ β − t ( . 100 … ) β = β − t + 1 \begin{align*}
0 &\le \frac{x - \fl{x}}{x} \\
&\le \frac{\beta^{-t+e}}{(.a_1a_2\ldots)_\beta \cdot \beta^e} \\
&\le \frac{\beta^{-t}}{(.100\ldots)_\beta} \\
&= \beta^{-t+1}
\end{align*} 0 ≤ x x − fl ( x ) ≤ ( . a 1 a 2 … ) β ⋅ β e β − t + e ≤ ( .100 … ) β β − t = β − t + 1 4 Accuracy of floating numbers: Unit Round ¶ The unit round u \uround u
It is a positive floating point number. It is the smallest such number for which
fl ( 1 + u ) > 1
\fl{1 + \uround} > 1 fl ( 1 + u ) > 1 This means that for any δ < u \delta < \uround δ < u , then
fl ( 1 + δ ) = 1 \fl{1 + \delta} = 1 fl ( 1 + δ ) = 1 1 and 1 + δ 1 + \delta 1 + δ are identical within the computer arithmetic. Thus the unit round characterizes the accuracy of arithmetic computations . We have
u = { β − t + 1 for chopping 1 2 β − t + 1 for rounding \uround = \begin{cases}
\beta^{-t+1} & \textrm{for chopping} \\
\half\beta^{-t+1} & \textrm{for rounding}
\end{cases} u = { β − t + 1 2 1 β − t + 1 for chopping for rounding Hence we get the important relation for floating point numbers
fl ( x ) = x ( 1 + ϵ ) , ∣ ϵ ∣ ≤ u \boxed{\fl{x} = x(1 + \epsilon), \qquad |\epsilon| \le \uround} fl ( x ) = x ( 1 + ϵ ) , ∣ ϵ ∣ ≤ u With rounding on a binary computer, show that u = 2 − t \uround = 2^{-t} u = 2 − t
We must show that
fl ( 1 + 2 − t ) > 1 \fl{1 + 2^{-t}} > 1 fl ( 1 + 2 − t ) > 1 Firstly 2 − t = ( . 10 … 0 ) 2 ⋅ 2 − t + 1 2^{-t} = (.10\ldots 0)_2 \cdot 2^{-t+1} 2 − t = ( .10 … 0 ) 2 ⋅ 2 − t + 1 is a floating point number.
1 + 2 − t = [ ( . 10 … 0 ) 2 + ( . 00 … 010 … ) 2 ] ⋅ 2 1 = ( . 10 … 01 ) 2 ⋅ 2 1 (1 at ( t + 1 ) ’th position ) \begin{align*}
1 + 2^{-t} &= [ (.10\ldots 0)_2 + (.00\ldots 010\ldots)_2] \cdot 2^1 \\
&= (.10\ldots 01)_2 \cdot 2^1 \qquad \textrm{(1 at $(t+1)$'th position})
\end{align*} 1 + 2 − t = [( .10 … 0 ) 2 + ( .00 … 010 … ) 2 ] ⋅ 2 1 = ( .10 … 01 ) 2 ⋅ 2 1 (1 at ( t + 1 ) ’th position ) Hence rounding gives
fl ( 1 + 2 − t ) = ( . 10 … 10 ) 2 ⋅ 2 1 rounding gives 1 at t ’th position = 1 + 2 − t + 1 > 1 \begin{align*}
\fl{1 + 2^{-t}} =& (.10\ldots 10)_2 \cdot 2^1 \\
& \textrm{rounding gives 1 at $t$'th position} \\
=& 1 + 2^{-t+1} \\
>& 1
\end{align*} fl ( 1 + 2 − t ) = = > ( .10 … 10 ) 2 ⋅ 2 1 rounding gives 1 at t ’th position 1 + 2 − t + 1 1 If δ < u \delta < \uround δ < u , then
1 + δ = [ ( . 10 … 0 ) 2 + ( . 00 … 00 … ) 2 ] ⋅ 2 1 zero in ( t + 1 ) ’th position = ( . 10 … 00 … ) 2 ⋅ 2 1 \begin{align*}
1 + \delta =& [ (.10\ldots 0)_2 + (.00\ldots 00\ldots)_2] \cdot 2^1 \\
& \qquad\qquad \textrm{zero in $(t+1)$'th position} \\
=& (.10\ldots 00\ldots)_2 \cdot 2^1
\end{align*} 1 + δ = = [( .10 … 0 ) 2 + ( .00 … 00 … ) 2 ] ⋅ 2 1 zero in ( t + 1 ) ’th position ( .10 … 00 … ) 2 ⋅ 2 1 Hence, rounding to t t t places gives
fl ( 1 + δ ) = ( . 10 … 0 ) 2 ⋅ 2 1 = 1 \fl{1+\delta} = (.10\ldots 0)_2 \cdot 2^1 = 1 fl ( 1 + δ ) = ( .10 … 0 ) 2 ⋅ 2 1 = 1 The following code finds the unit round by finding the smallest i i i such that 1 + 2 − i = 1 1 + 2^{-i} = 1 1 + 2 − i = 1 .
i, x = 0, 1.0
while 1.0 + x > 1.0:
x = x / 2
i = i + 1
print(i)
This shows that in double precision, 1 + 2 − 53 = 1 1 + 2^{-53} = 1 1 + 2 − 53 = 1 , and thus the unit round is u = 2 − 53 \uround = 2^{-53} u = 2 − 53 .
5 Underflow and overflow ¶ The smallest positive floating point number in the system ( β , t , L , U ) (\beta,t,L,U) ( β , t , L , U ) is
x L = ( . 10 … . 0 ) β ⋅ β L = β L − 1 x_L = (.10\ldots.0)_\beta \cdot \beta^L = \beta^{L-1} x L = ( .10 … .0 ) β ⋅ β L = β L − 1 and the largest positive floating point number is
x U = ( . γ γ … γ ) β ⋅ β U = ( 1 − β − t ) β U x_U = (.\gamma\gamma\ldots\gamma)_\beta \cdot \beta^U = (1-\beta^{-t}) \beta^U x U = ( . γγ … γ ) β ⋅ β U = ( 1 − β − t ) β U The range of numbers that are representable in the floating-point system is
x L ≤ ∣ x ∣ ≤ x U x_L \le |x| \le x_U x L ≤ ∣ x ∣ ≤ x U Obviously, we cannot represent a number which is outside this range.
Overflow error: ∣ x ∣ > x U |x| > x_U ∣ x ∣ > x U leads to fatal error in computation program may terminate Underflow error: ∣ x ∣ < x L |x| < x_L ∣ x ∣ < x L may not be a fatal error fl ( x ) \fl{x} fl ( x ) is set to zero and computations can continueHowever, in some cases, accuracy may be severly lost. 6 IEEE floating point system ¶ The IEEE floating point system was developed under the leadership of Prof. William Kahan of Univ. of California, Berkely. Project 754 was formed in 1970 to produce the best possible definition of floating-point arithmetic. The IEEE 754 Standard gives
definition of floating-point numbers rounding operations format conversion exception rules (invalid operation, division by zero, overflow, underflow) 6.1 IEEE single precision ¶ Fortran77: real*4
, Fortran90: real
, C/C++: float
32 bit word 1 bit for sign 8 bits for biased exponent ( L = − 126 , U = 127 ) (L=-126, U=127) ( L = − 126 , U = 127 ) ; 2 8 = 256 = 126 + 1 + 127 + 1 + 1 2^8 = 256 = 126 + 1 + 127 + 1 + 1 2 8 = 256 = 126 + 1 + 127 + 1 + 1 , the last two are for Inf and NaN. 23 bits for fractional part (mantissa) Largest number = ( 2 − 2 − 23 ) ⋅ 2 127 ≈ 3.4028 × 1 0 38 (2 - 2^{-23}) \cdot 2^{127} \approx 3.4028 \times 10^{38} ( 2 − 2 − 23 ) ⋅ 2 127 ≈ 3.4028 × 1 0 38 Smallest positive normalized number = 2 − 126 ≈ 1.1755 × 1 0 − 38 2^{-126} \approx 1.1755 \times 10^{-38} 2 − 126 ≈ 1.1755 × 1 0 − 38 Unit roundoff, u = 2 − 24 ≈ 5.96 × 1 0 − 8 \uround = 2^{-24} \approx 5.96 \times 10^{-8} u = 2 − 24 ≈ 5.96 × 1 0 − 8 , corresponds to about seven decimal places. 6.2 IEEE double precision ¶ Fortran77: real*8
, Fortran90: double precision
, C/C++: double
Based on two 32 bit words or one 64 bit word 1 bit for the sign 11 bits for the biased exponent ( L = − 1022 , U = 1023 ) (L=-1022, U=1023) ( L = − 1022 , U = 1023 ) 52 bits for the fractional part Largest number = ( 2 − 2 − 52 ) ⋅ 2 1023 ≈ 1.7977 × 1 0 308 (2 - 2^{-52}) \cdot 2^{1023} \approx 1.7977 \times 10^{308} ( 2 − 2 − 52 ) ⋅ 2 1023 ≈ 1.7977 × 1 0 308 Smallest positive normalized number = 2 − 1022 ≈ 2.2251 × 1 0 − 308 2^{-1022} \approx 2.2251 \times 10^{-308} 2 − 1022 ≈ 2.2251 × 1 0 − 308 Unit round u = 2 − 53 ≈ 1.11 × 1 0 − 16 \uround = 2^{-53} \approx 1.11 \times 10^{-16} u = 2 − 53 ≈ 1.11 × 1 0 − 16 7 Precision of computer arithmetic and physics ¶ With IEEE double precision, the interval [ 1 , 2 ] [1,2] [ 1 , 2 ] is approximated by about 1016 floating point numbers. In a handful of solid or liquid, or a balloon of gas, the number of atoms/molecules in a line from one point to another is of the order of 108 (cube root of Avogadro number). Such a system behaves like a continuum, enabling us to define physical quantities like density, pressure, stress, strain and temperature. Computer arithmetic is more than a million times finer tham this.
The fundamental constants of physics are known to only a few decimal places.
gravitational constant G G G - 4 digits Planck’s constant h h h - 7 digits elementary charge e e e - 7 digits ratio of magnetic moment of electron to the Bohr magneton μ e / μ B \mu_e/\mu_B μ e / μ B - 12 digits Nothing in physics is known to more than 12 or 13 digits of accuracy. IEEE numbers are orders of magnitude more precise than any number in science. Mathematical quantities like π are of course known to more accuracy.
In the early days of numerical analysis, computers were not very precise, numerical algorithms, especially conditioning and stability were not well understood. Hence there was too much emphasis on studying rounding errors. Today, computer arithmetic is more precise and well understood, and we can control rounding errors through stability of numerical algorithms.
The main business of numerical analysis is designing algorithms that converge quickly; rounding error analysis is rarely the central issue. If rounding errors vanished, 99% of numerical analysis would remain. - Lloyd N. Trefethen
8 Propagation of errors ¶ Let us consider the basic arithmetic operations: + , − , ∗ , / +, -, *, / + , − , ∗ , / . The computer version of these operations is an approximation due to rounding.
⊗ : exact operation ⊗ ^ : computer approximation \op: \textrm{exact operation} \qquad \ap: \textrm{computer approximation} ⊗ : exact operation ⊗ ^ : computer approximation Let
x , y : real numbers , x ^ , y ^ : computer representation x, y : \textrm{real numbers}, \qquad \hatx, \haty : \textrm{computer representation} x , y : real numbers , x ^ , y ^ : computer representation where
x ^ = fl ( x ) = x ( 1 + ϵ x ) , y ^ = fl ( y ) = y ( 1 + ϵ y ) \hatx = \fl{x} = x(1+\epsilon_x), \qquad \haty = \fl{y} = y (1 + \epsilon_y) x ^ = fl ( x ) = x ( 1 + ϵ x ) , y ^ = fl ( y ) = y ( 1 + ϵ y ) Let us write this as
x = x ^ + ϵ , y = y ^ + η x = \hatx + \epsilon, \qquad y = \haty + \eta x = x ^ + ϵ , y = y ^ + η Also
x ⊗ y : exact value , x ^ ⊗ ^ y ^ : computed value x \op y : \textrm{exact value}, \qquad \hatx \ap \haty : \textrm{computed value} x ⊗ y : exact value , x ^ ⊗ ^ y ^ : computed value where
x ^ ⊗ ^ y ^ = fl ( x ^ ⊗ y ^ ) \hatx \ap \haty = \fl{\hatx \op \haty} x ^ ⊗ ^ y ^ = fl ( x ^ ⊗ y ^ ) ⊗ ^ \ap ⊗ ^ comes about because x ^ ⊗ y ^ \hatx \op \haty x ^ ⊗ y ^ may not be representable in the floating point system and has to be rounded !!! The error is
x ⊗ y − x ^ ⊗ ^ y ^ = [ x ⊗ y − x ^ ⊗ y ^ ] ⏟ propagated error + [ x ^ ⊗ y ^ − x ^ ⊗ ^ y ^ ] ⏟ rounding error x \op y - \hatx \ap \haty = \underbrace{[x \op y - \hatx \op \hat y]}
_{\textrm{propagated error}} + \underbrace{[\hatx \op \haty - \hatx \ap \haty]}
_{\textrm{rounding error}} x ⊗ y − x ^ ⊗ ^ y ^ = propagated error [ x ⊗ y − x ^ ⊗ y ^ ] + rounding error [ x ^ ⊗ y ^ − x ^ ⊗ ^ y ^ ] The rounding error is bounded by the unit round
∣ x ^ ⊗ y ^ − x ^ ⊗ ^ y ^ ∣ ≤ ∣ x ^ ⊗ y ^ ∣ 1 2 β − t + 1 |\hatx \op \haty - \hatx \ap \haty| \le |\hatx \op \haty| \half \beta^{-t+1} ∣ x ^ ⊗ y ^ − x ^ ⊗ ^ y ^ ∣ ≤ ∣ x ^ ⊗ y ^ ∣ 2 1 β − t + 1 and the only way to control this is to use higher precision. Let us study the propagated error.
8.1 Multiplication ¶ x y − x ^ y ^ = x y − ( x − ϵ ) ( y − η ) = x η + y ϵ − ϵ η xy - \hatx\haty = xy - (x-\epsilon)(y-\eta) = x\eta + y\epsilon - \epsilon\eta x y − x ^ y ^ = x y − ( x − ϵ ) ( y − η ) = x η + yϵ − ϵη The relative error is
RE ( x ^ y ^ ) : = x y − x ^ y ^ x y = ϵ x + η y − ϵ x η y = RE ( x ^ ) + RE ( y ^ ) − RE ( x ^ ) RE ( y ^ ) \rel{\hatx\haty} := \frac{xy - \hatx\haty}{xy} = \frac{\epsilon}{x} + \frac{\eta}{y} -
\frac{\epsilon}{x} \frac{\eta}{y} = \rel{\hatx} + \rel{\haty} - \rel{\hatx}
\rel{\haty} RE ( x ^ y ^ ) := x y x y − x ^ y ^ = x ϵ + y η − x ϵ y η = RE ( x ^ ) + RE ( y ^ ) − RE ( x ^ ) RE ( y ^ ) Hence
RE ( x ^ y ^ ) ≈ RE ( x ^ ) + RE ( y ^ ) \rel{\hatx\haty} \approx \rel{\hatx} + \rel{\haty} RE ( x ^ y ^ ) ≈ RE ( x ^ ) + RE ( y ^ ) 8.2 Division ¶ Consider division of two numbers
RE ( x ^ y ^ ) = RE ( x ) − RE ( y ) 1 − RE ( y ^ ) \rel{\frac{\hatx}{\haty}} = \frac{\rel{x} - \rel{y}}{1 - \rel{\haty}} RE ( y ^ x ^ ) = 1 − RE ( y ^ ) RE ( x ) − RE ( y ) If RE ( y ^ ) ≪ 1 \rel{\haty} \ll 1 RE ( y ^ ) ≪ 1 then
RE ( x ^ y ^ ) ≈ RE ( x ^ ) − RE ( y ^ ) \rel{\frac{\hatx}{\haty}} \approx \rel{\hatx} - \rel{\haty} RE ( y ^ x ^ ) ≈ RE ( x ^ ) − RE ( y ^ ) For multiplication and division, relative errors do not propagate rapidly.
8.3 Addition/subtraction ¶ Consider adding or subtracting two numbers
( x ± y ) − ( x ^ ± y ^ ) = ( x − x ^ ) ± ( y − y ^ ) = ϵ + η (x \pm y) - (\hatx \pm \haty) = (x - \hatx) \pm (y - \haty) = \epsilon + \eta ( x ± y ) − ( x ^ ± y ^ ) = ( x − x ^ ) ± ( y − y ^ ) = ϵ + η The absolute error is
AE ( x ^ ± y ^ ) = AE ( x ^ ) ± AE ( y ^ ) \err{\hatx \pm \haty} = \err{\hatx} \pm \err{\haty} AE ( x ^ ± y ^ ) = AE ( x ^ ) ± AE ( y ^ ) The absolute error is well-behaved, but relative errors can be large if x x x and y y y are nearly equal and we are doing subtraction.
Example 1 (Round-off errors)
Evaluate
f ( x ) = 1 − cos x f(x) = 1 - \cos x f ( x ) = 1 − cos x for x = 1 0 − 8 x = 10^{-8} x = 1 0 − 8 . This yields f = 0.0 f = 0.0 f = 0.0 even in double precision.
from math import cos
x = 1.0e-8
y = 1.0 - cos(x)
print("%24.14e" % y)
An equivalent expression is
f ( x ) = 2 sin 2 ( x / 2 ) f(x) = 2 \sin^2(x/2) f ( x ) = 2 sin 2 ( x /2 ) which yields f = 5 × 1 0 − 17 f = 5 \times 10^{-17} f = 5 × 1 0 − 17 .
from math import sin
z = 2.0*sin(0.5*x)**2
print("%24.14e" % z)
The first form suffers from roundoff errors while the second one is more accurate. This may or may not be a problem. A situation where this is problematic is if you want to compute
f ( x ) = 1 − cos ( x ) x 2 f(x) = \frac{1 - \cos(x)}{x^2} f ( x ) = x 2 1 − cos ( x ) y = (1.0 - cos(x))/x**2
print("%24.14e" % y)
We will get an answer of zero, but the correct value is closer to 1 2 \half 2 1 .
z = 2.0*sin(0.5*x)**2 / x**2
print("%24.14e" % z)
Example 3 (polynomial evaluation)
Let us evaluate
y = x 3 − 3 x 2 + 3 x − 1 y = x^3 - 3 x^2 + 3 x - 1 y = x 3 − 3 x 2 + 3 x − 1 in single precision. Note that it can also be written as
y = ( x − 1 ) 3 y = (x-1)^3 y = ( x − 1 ) 3 x = linspace(0.99,1.01,50,dtype=float32)
y = x**3 - 3.0*x**2 + 3.0*x - 1.0
plot(x,y,'-o',x,(x-1)**3)
xlabel('x'), ylabel('y')
legend(('Single precision','Exact'));
Now, let us do it in double precision.
x = linspace(0.99,1.01,50,dtype=float64)
y = x**3 - 3.0*x**2 + 3.0*x - 1.0
plot(x,y,'-o',x,(x-1)**3)
xlabel('x'), ylabel('y')
legend(('Double precision','Exact'));
9 Errors in summation ¶ Suppose we want to compute the sum
s n = ∑ i = 1 n x i s_n = \sum_{i=1}^n x_i s n = i = 1 ∑ n x i In a computer program, we would perform a loop and accumulate the sum sequentially. E.g., in fortran, the code may look like this:
s = 0.0
do i=1,n
s = s + x(i)
enddo
Let us assume that all the x i x_i x i are floating point numbers. Let s 1 = x 1 s_1 = x_1 s 1 = x 1 and compute
s k + 1 = s k + x k + 1 , k = 1 , 2 , … , n − 1 s_{k+1} = s_k + x_{k+1}, \qquad k=1,2,\ldots,n-1 s k + 1 = s k + x k + 1 , k = 1 , 2 , … , n − 1 But a computer will perform roundoff in each step, so we actually compute
s ^ 1 = x 1 , s ^ k + 1 = fl ( s ^ k + x k + 1 ) = ( s ^ k + x k + 1 ) ( 1 + ϵ k + 1 ) \hats_1 = x_1, \qquad \hats_{k+1} = \fl{\hats_k + x_{k+1}} = (\hats_k + x_{k+1})(1 + \epsilon_{k+1}) s ^ 1 = x 1 , s ^ k + 1 = fl ( s ^ k + x k + 1 ) = ( s ^ k + x k + 1 ) ( 1 + ϵ k + 1 ) where each ∣ ϵ k ∣ ≤ u |\epsilon_k| \le \uround ∣ ϵ k ∣ ≤ u . Define the relative error
ρ k = s ^ k − s k s k \rho_k = \frac{\hats_k - s_k}{s_k} ρ k = s k s ^ k − s k Then
ρ k + 1 = s ^ k + 1 − s k + 1 s k + 1 = ( s ^ k + x k + 1 ) ( 1 + ϵ k + 1 ) − ( s k + x k + 1 ) s k + 1 = [ s k ( 1 + ρ k ) + x k + 1 ] ( 1 + ϵ k + 1 ) − ( s k + x k + 1 ) s k + 1 = ϵ k + 1 + ρ k s k s k + 1 ( 1 + ϵ k + 1 ) \begin{align*}
\rho_{k+1} =& \frac{\hats_{k+1} - s_{k+1}}{s_{k+1}} \\
=& \frac{(\hats_k + x_{k+1})(1+\epsilon_{k+1}) - (s_k + x_{k+1})}{s_{k+1}} \\
=& \frac{[s_k(1+\rho_k) + x_{k+1}](1+\epsilon_{k+1}) - (s_k + x_{k+1})}
{s_{k+1}} \\
=& \epsilon_{k+1} + \rho_k \frac{s_k}{s_{k+1}}(1 + \epsilon_{k+1})
\end{align*} ρ k + 1 = = = = s k + 1 s ^ k + 1 − s k + 1 s k + 1 ( s ^ k + x k + 1 ) ( 1 + ϵ k + 1 ) − ( s k + x k + 1 ) s k + 1 [ s k ( 1 + ρ k ) + x k + 1 ] ( 1 + ϵ k + 1 ) − ( s k + x k + 1 ) ϵ k + 1 + ρ k s k + 1 s k ( 1 + ϵ k + 1 ) Let us assume that each x i ≥ 0 x_i \ge 0 x i ≥ 0 so that s k ≤ s k + 1 s_k \le s_{k+1} s k ≤ s k + 1 . Hence
∣ ρ k + 1 ∣ ≤ u + ∣ ρ k ∣ ( 1 + u ) = u + ∣ ρ k ∣ θ , θ : = 1 + u |\rho_{k+1}| \le \uround + |\rho_k| (1 + \uround) = \uround + |\rho_k| \theta, \qquad \theta := 1 + \uround ∣ ρ k + 1 ∣ ≤ u + ∣ ρ k ∣ ( 1 + u ) = u + ∣ ρ k ∣ θ , θ := 1 + u and so
∣ ρ 1 ∣ = 0 ∣ ρ 2 ∣ ≤ u ∣ ρ 3 ∣ ≤ u + u θ = ( 1 + θ ) u ∣ ρ 4 ∣ ≤ u + ( u + u θ ) θ = ( 1 + θ + θ 2 ) u ⋮ \begin{align*}
|\rho_1| =& 0 \\
|\rho_2| \le& \uround \\
|\rho_3| \le& \uround + \uround\theta = (1 + \theta) \uround\\
|\rho_4| \le& \uround + (\uround + \uround\theta)\theta = (1 + \theta + \theta^2) \uround \\
\vdots&
\end{align*} ∣ ρ 1 ∣ = ∣ ρ 2 ∣ ≤ ∣ ρ 3 ∣ ≤ ∣ ρ 4 ∣ ≤ ⋮ 0 u u + u θ = ( 1 + θ ) u u + ( u + u θ ) θ = ( 1 + θ + θ 2 ) u The relative error in the sum can be bounded as
∣ ρ n ∣ ≤ ( 1 + θ + … + θ n − 2 ) u = [ θ n − 1 − 1 θ − 1 ] u = ( 1 + u ) n − 1 − 1 ≈ ( n − 1 ) u \begin{align*}
|\rho_n| \le& (1 + \theta + \ldots + \theta^{n-2}) \uround \\
=& \left[ \frac{\theta^{n-1} - 1}{\theta-1} \right] \uround \\
=& (1 + \uround)^{n-1} - 1 \\
\approx& (n-1) \uround
\end{align*} ∣ ρ n ∣ ≤ = = ≈ ( 1 + θ + … + θ n − 2 ) u [ θ − 1 θ n − 1 − 1 ] u ( 1 + u ) n − 1 − 1 ( n − 1 ) u We have ( n − 1 ) (n-1) ( n − 1 ) additions and the above estimate says that the relative error is propertional to the number of additions. Usually this is a pessimistic estimate since we took absolute values, and in actual computations, the succesive errors can cancel, leading to much smaller total error in the sum.
9.1 A closer look ¶ Let us estimate some of the partial sums.
s ^ 1 = x 1 s ^ 2 = ( s ^ 1 + x 2 ) ( 1 + ϵ 2 ) = s 2 + s 2 ϵ 2 s ^ 3 = ( s ^ 2 + x 3 ) ( 1 + ϵ 3 ) = ( s 3 + s 2 ϵ 2 ) ( 1 + ϵ 3 ) = s 3 + s 2 ϵ 2 + s 3 ϵ 3 + O ( u 2 ) s ^ 4 = ( s ^ 3 + x 4 ) ( 1 + ϵ 4 ) = ( s 4 + s 2 ϵ 2 + s 3 ϵ 3 + O ( u 2 ) ) ( 1 + ϵ 4 ) = s 4 + s 2 ϵ 2 + s 3 ϵ 3 + s 4 ϵ 4 + O ( u 2 ) ⋮ \begin{align*}
\hats_1 =& x_1 \\
\hats_2 =& (\hats_1 + x_2)(1 + \epsilon_2) = s_2 + s_2 \epsilon_2 \\
\hats_3 =& (\hats_2 + x_3)(1 + \epsilon_3) = (s_3 + s_2 \epsilon_2)(1+\epsilon_3) =
s_3 + s_2 \epsilon_2 + s_3 \epsilon_3 + O(\uround^2) \\
\hats_4 =& (\hats_3 + x_4)(1+\epsilon_4) = (s_4 + s_2 \epsilon_2 + s_3 \epsilon_3 +
O(\uround^2))(1 + \epsilon_4) \\
=& s_4 + s_2 \epsilon_2 + s_3 \epsilon_3 + s_4 \epsilon_4 + O(\uround^2) \\
\vdots &
\end{align*} s ^ 1 = s ^ 2 = s ^ 3 = s ^ 4 = = ⋮ x 1 ( s ^ 1 + x 2 ) ( 1 + ϵ 2 ) = s 2 + s 2 ϵ 2 ( s ^ 2 + x 3 ) ( 1 + ϵ 3 ) = ( s 3 + s 2 ϵ 2 ) ( 1 + ϵ 3 ) = s 3 + s 2 ϵ 2 + s 3 ϵ 3 + O ( u 2 ) ( s ^ 3 + x 4 ) ( 1 + ϵ 4 ) = ( s 4 + s 2 ϵ 2 + s 3 ϵ 3 + O ( u 2 )) ( 1 + ϵ 4 ) s 4 + s 2 ϵ 2 + s 3 ϵ 3 + s 4 ϵ 4 + O ( u 2 ) Hence the error in the sum is
s ^ n − s n = s 2 ϵ 2 + … + s n ϵ n + O ( u 2 ) = x 1 ( ϵ 2 + … + ϵ n ) + x 2 ( ϵ 2 + … + ϵ n ) + x 3 ( ϵ 3 + … + ϵ n ) + … + x n ϵ n \begin{align*}
\hats_n - s_n =& s_2 \epsilon_2 + \ldots + s_n \epsilon_n + O(\uround^2) \\
=& x_1 (\epsilon_2 + \ldots + \epsilon_n) + x_2 (\epsilon_2 + \ldots + \epsilon_n) + x_3 (\epsilon_3 + \ldots + \epsilon_n) + \ldots + x_n \epsilon_n
\end{align*} s ^ n − s n = = s 2 ϵ 2 + … + s n ϵ n + O ( u 2 ) x 1 ( ϵ 2 + … + ϵ n ) + x 2 ( ϵ 2 + … + ϵ n ) + x 3 ( ϵ 3 + … + ϵ n ) + … + x n ϵ n Note that the coefficients of x i x_i x i decrease with increasing i i i . If we want to minimize the error, we can sort the numbers in increasing order x 1 ≤ x 2 ≤ … ≤ x n x_1 \le x_2 \le \ldots \le x_n x 1 ≤ x 2 ≤ … ≤ x n and then sum them. In practice, the ϵ i \epsilon_i ϵ i can be positive or negative, so the precise behaviour can be case dependent.
10 Function evaluation ¶ Suppose we want to compute a function value
y = f ( x ) , x ∈ R y = f(x), \qquad x \in \re y = f ( x ) , x ∈ R There are two types of errors we have to deal with on a computer. Firstly, the real number has to be approximated by a floating point number
x ^ = fl ( x ) \hatx = \fl{x} x ^ = fl ( x ) Secondly, the function f f f may be computed only approximately as f ^ \hatf f ^ . The only computations a computer can do exactly are addition and multiplication, and even here we have roundoff errors. Any other type of computation, even division, can only be performed approximately. E.g., if f ( x ) = sin ( x ) f(x) = \sin(x) f ( x ) = sin ( x ) , a computer will evaluate it approximately by some truncated Taylor approximation of sin ( x ) \sin(x) sin ( x ) . So what we actually evaluate on a computer is
y ^ = f ^ ( x ^ ) \haty = \hatf(\hatx) y ^ = f ^ ( x ^ ) The absolute error is
AE ( y ^ ) = ∣ y − y ^ ∣ = ∣ f ( x ) − f ^ ( x ) + f ^ ( x ) − f ^ ( x ^ ) ∣ ≤ ∣ f ( x ) − f ^ ( x ) ∣ + ∣ f ^ ( x ) − f ^ ( x ^ ) ∣ \begin{align*}
\err{\haty} =& |y - \haty| \\
=& |f(x) - \hatf(x) + \hatf(x) - \hatf(\hatx)| \\
\le& |f(x) - \hatf(x)| + |\hatf(x) - \hatf(\hatx)|
\end{align*} AE ( y ^ ) = = ≤ ∣ y − y ^ ∣ ∣ f ( x ) − f ^ ( x ) + f ^ ( x ) − f ^ ( x ^ ) ∣ ∣ f ( x ) − f ^ ( x ) ∣ + ∣ f ^ ( x ) − f ^ ( x ^ ) ∣ We have two sources of error.
∣ f ( x ) − f ^ ( x ) ∣ |f(x) - \hatf(x)| ∣ f ( x ) − f ^ ( x ) ∣ : numerical discretization error, can be controlled by using an accurate algorithm∣ f ^ ( x ) − f ^ ( x ^ ) ∣ |\hatf(x) - \hatf(\hatx)| ∣ f ^ ( x ) − f ^ ( x ^ ) ∣ : transmission of round-off error by the numerical scheme, can be controlled by stability of numerical scheme and using enough precision so that x − x ^ x - \hatx x − x ^ is small.The function
f ( x ) = c o t h ( x ) − 1 / x f(x) = coth(x) - 1/x f ( x ) = co t h ( x ) − 1/ x is finite at x = 0 x=0 x = 0 but the two terms are not. Note that
f ( x ) = x 3 − x 3 45 + 2 x 5 945 + … f(x) = \frac{x}{3} - \frac{x^3}{45} + \frac{2 x^5}{945} + \ldots f ( x ) = 3 x − 45 x 3 + 945 2 x 5 + … Approximate this in two parts [ 0 , δ ] [0,\delta] [ 0 , δ ] and [ δ , ∞ ) [\delta,\infty) [ δ , ∞ ) . How to choose δ ?