Skip to article frontmatterSkip to article content

Computer representation of numbers

#%config InlineBackend.figure_format = 'svg'
from pylab import *

We are used to decimal system for counting. In general we can use any base β for the number system. On most modern computers, the base β=2\beta = 2, i.e., we use the binary number system. Any non-zero number xx is written as

x=s(.a1a2at)ββe=(sign)(fractional part)(exponent)x = s \cdot (.a_1 a_2 \ldots a_t)_\beta \cdot \beta^e = \textrm{(sign)} \cdot \textrm{(fractional part)} \cdot \textrm{(exponent)}

where

s=1 or +10aiβ1LeU is an integer\begin{aligned} s = -1 \textrm{ or } +1 \\ 0 \le a_i \le \beta - 1 \\ L \le e \le U \textrm{ is an integer} \end{aligned}

In base 10, the fractional part has the value

(.a1a2at)β=a1β+a2β2++atβt(.a_1 a_2 \ldots a_t)_\beta = \frac{a_1}{\beta} + \frac{a_2}{\beta^2} + \ldots + \frac{a_t}{\beta^t}

We will assume that a10a_1 \ne 0 which is called a normalized floating point system and then

1aiβ1,i=1,2,,t1 \le a _i \le \beta-1, \qquad i=1,2,\ldots,t

(β,t,L,U)(\beta, t, L,U) specifies the arithmetic characteristics of the floating point system.

1Chopping and rounding

Most real numbers cannot be exactly represented on a computer since we have finite precision. Hence they must be represented by a nearby number in the floating point system. This requires us to truncate a real number to fit into the floating point system, which can be done in two ways: chopping and rounding.

Consider a real number

x=s(.a1a2atat+1)ββe,a10x = s \cdot (.a_1 a_2 \ldots a_t a_{t+1} \ldots )_\beta \cdot \beta^e, \qquad a_1 \ne 0

Let

fl(x)=approximation of x in the floating point system\fl{x} = \textrm{approximation of $x$ in the floating point system}

chopped machine representation

fl(x)=s(.a1a2at)ββe\fl{x} = s \cdot (.a_1 a_2 \ldots a_t)_\beta \cdot \beta^e

We throw away all digits in the fractional part that do not fit into the floating-point system.

rounded machine representation

fl(x)={s(.a1a2at)ββe0at+1<12βs[(.a1a2at)β+(.001)β]βe12βat+1<β\fl{x} = \begin{cases} s \cdot (.a_1 a_2 \ldots a_t)_\beta \cdot \beta^e & 0 \le a_{t+1} < \half\beta \\ s \cdot [(.a_1 a_2 \ldots a_t)_\beta + (.0\ldots01)_\beta ]\cdot \beta^e & \half\beta \le a_{t+1} < \beta \end{cases}

We approximate by the nearest floating point number.

2Error in computer representation

For most real numbers xfl(x)x \ne \fl{x}. Define the relative error

ϵ=ϵ(x)=fl(x)xx\epsilon = \epsilon(x) = \frac{\fl{x} - x}{x}

We have following bounds on the relative error

  • Chopping
βt+1ϵ0-\beta^{-t + 1} \le \epsilon \le 0
  • Rounding
12βt+1ϵ12βt+1-\half \beta^{-t + 1} \le \epsilon \le \half \beta^{-t+1}

Thus the floating point approximation fl(x)\fl{x} of xRx \in \re satisfies

fl(x)=x(1+ϵ)\fl{x} = x (1 + \epsilon)

This type of analysis was introduced by Wilkinson (1963). It allows us to deal precisely with the effects of rounding/chopping in computer arithmetic operations.

3Proof for chopping

Take sign s=+1s=+1. Then fl(x)x\fl{x} \le x and

xfl(x)=(.000at+1)ββex - \fl{x} = (.00\ldots 0a_{t+1}\ldots)_\beta \cdot \beta^e

Let γ=β1\gamma=\beta-1 (largest digit in base β). Then

0xfl(x)(.000γγ)ββe=γ[1βt+1+1βt+2+]βe=γβt11β1βe=βt+e\begin{align*} 0 & \le x - \fl{x} \\ & \le (.00\ldots 0 \gamma\gamma\ldots)_\beta \cdot \beta^e \\ & = \gamma \left[ \frac{1}{\beta^{t+1}} + \frac{1}{\beta^{t+2}} + \ldots \right] \beta^e \\ & = \gamma \frac{\beta^{-t-1}}{1- \beta^{-1}} \beta^e \\ & = \beta^{-t + e} \end{align*}

The relative error is

0xfl(x)xβt+e(.a1a2)ββeβt(.100)β=βt+1\begin{align*} 0 &\le \frac{x - \fl{x}}{x} \\ &\le \frac{\beta^{-t+e}}{(.a_1a_2\ldots)_\beta \cdot \beta^e} \\ &\le \frac{\beta^{-t}}{(.100\ldots)_\beta} \\ &= \beta^{-t+1} \end{align*}

4Accuracy of floating numbers: Unit Round

The unit round u\uround

  • It is a positive floating point number.
  • It is the smallest such number for which
    fl(1+u)>1 \fl{1 + \uround} > 1

This means that for any δ<u\delta < \uround, then

fl(1+δ)=1\fl{1 + \delta} = 1

1 and 1+δ1 + \delta are identical within the computer arithmetic. Thus the unit round characterizes the accuracy of arithmetic computations. We have

u={βt+1for chopping12βt+1for rounding\uround = \begin{cases} \beta^{-t+1} & \textrm{for chopping} \\ \half\beta^{-t+1} & \textrm{for rounding} \end{cases}

Hence we get the important relation for floating point numbers

fl(x)=x(1+ϵ),ϵu\boxed{\fl{x} = x(1 + \epsilon), \qquad |\epsilon| \le \uround}

With rounding on a binary computer, show that u=2t\uround = 2^{-t}

We must show that

fl(1+2t)>1\fl{1 + 2^{-t}} > 1

Firstly 2t=(.100)22t+12^{-t} = (.10\ldots 0)_2 \cdot 2^{-t+1} is a floating point number.

1+2t=[(.100)2+(.00010)2]21=(.1001)221(1 at (t+1)’th position)\begin{align*} 1 + 2^{-t} &= [ (.10\ldots 0)_2 + (.00\ldots 010\ldots)_2] \cdot 2^1 \\ &= (.10\ldots 01)_2 \cdot 2^1 \qquad \textrm{(1 at $(t+1)$'th position}) \end{align*}

Hence rounding gives

fl(1+2t)=(.1010)221rounding gives 1 at t’th position=1+2t+1>1\begin{align*} \fl{1 + 2^{-t}} =& (.10\ldots 10)_2 \cdot 2^1 \\ & \textrm{rounding gives 1 at $t$'th position} \\ =& 1 + 2^{-t+1} \\ >& 1 \end{align*}

If δ<u\delta < \uround, then

1+δ=[(.100)2+(.0000)2]21zero in (t+1)’th position=(.1000)221\begin{align*} 1 + \delta =& [ (.10\ldots 0)_2 + (.00\ldots 00\ldots)_2] \cdot 2^1 \\ & \qquad\qquad \textrm{zero in $(t+1)$'th position} \\ =& (.10\ldots 00\ldots)_2 \cdot 2^1 \end{align*}

Hence, rounding to tt places gives

fl(1+δ)=(.100)221=1\fl{1+\delta} = (.10\ldots 0)_2 \cdot 2^1 = 1

The following code finds the unit round by finding the smallest ii such that 1+2i=11 + 2^{-i} = 1.

i, x = 0, 1.0
while 1.0 + x > 1.0:
    x = x / 2
    i = i + 1
print(i)
53

This shows that in double precision, 1+253=11 + 2^{-53} = 1, and thus the unit round is u=253\uround = 2^{-53}.

5Underflow and overflow

The smallest positive floating point number in the system (β,t,L,U)(\beta,t,L,U) is

xL=(.10.0)ββL=βL1x_L = (.10\ldots.0)_\beta \cdot \beta^L = \beta^{L-1}

and the largest positive floating point number is

xU=(.γγγ)ββU=(1βt)βUx_U = (.\gamma\gamma\ldots\gamma)_\beta \cdot \beta^U = (1-\beta^{-t}) \beta^U

The range of numbers that are representable in the floating-point system is

xLxxUx_L \le |x| \le x_U

Obviously, we cannot represent a number which is outside this range.

  • Overflow error: x>xU|x| > x_U
    • leads to fatal error in computation
    • program may terminate
  • Underflow error: x<xL|x| < x_L
    • may not be a fatal error
    • fl(x)\fl{x} is set to zero and computations can continue
    • However, in some cases, accuracy may be severly lost.

6IEEE floating point system

The IEEE floating point system was developed under the leadership of Prof. William Kahan of Univ. of California, Berkely. Project 754 was formed in 1970 to produce the best possible definition of floating-point arithmetic. The IEEE 754 Standard gives

  • definition of floating-point numbers
  • rounding operations
  • format conversion
  • exception rules (invalid operation, division by zero, overflow, underflow)

6.1IEEE single precision

Fortran77: real*4, Fortran90: real, C/C++: float

  • 32 bit word
  • 1 bit for sign
  • 8 bits for biased exponent (L=126,U=127)(L=-126, U=127); 28=256=126+1+127+1+12^8 = 256 = 126 + 1 + 127 + 1 + 1, the last two are for Inf and NaN.
  • 23 bits for fractional part (mantissa)
  • Largest number = (2223)21273.4028×1038(2 - 2^{-23}) \cdot 2^{127} \approx 3.4028 \times 10^{38}
  • Smallest positive normalized number = 21261.1755×10382^{-126} \approx 1.1755 \times 10^{-38}
  • Unit roundoff, u=2245.96×108\uround = 2^{-24} \approx 5.96 \times 10^{-8}, corresponds to about seven decimal places.

6.2IEEE double precision

Fortran77: real*8, Fortran90: double precision, C/C++: double

  • Based on two 32 bit words or one 64 bit word
  • 1 bit for the sign
  • 11 bits for the biased exponent (L=1022,U=1023)(L=-1022, U=1023)
  • 52 bits for the fractional part
  • Largest number = (2252)210231.7977×10308(2 - 2^{-52}) \cdot 2^{1023} \approx 1.7977 \times 10^{308}
  • Smallest positive normalized number = 210222.2251×103082^{-1022} \approx 2.2251 \times 10^{-308}
  • Unit round u=2531.11×1016\uround = 2^{-53} \approx 1.11 \times 10^{-16}

7Precision of computer arithmetic and physics

With IEEE double precision, the interval [1,2][1,2] is approximated by about 1016 floating point numbers. In a handful of solid or liquid, or a balloon of gas, the number of atoms/molecules in a line from one point to another is of the order of 108 (cube root of Avogadro number). Such a system behaves like a continuum, enabling us to define physical quantities like density, pressure, stress, strain and temperature. Computer arithmetic is more than a million times finer tham this.

The fundamental constants of physics are known to only a few decimal places.

  • gravitational constant GG - 4 digits
  • Planck’s constant hh - 7 digits
  • elementary charge ee - 7 digits
  • ratio of magnetic moment of electron to the Bohr magneton μe/μB\mu_e/\mu_B - 12 digits

Nothing in physics is known to more than 12 or 13 digits of accuracy. IEEE numbers are orders of magnitude more precise than any number in science. Mathematical quantities like π are of course known to more accuracy.

In the early days of numerical analysis, computers were not very precise, numerical algorithms, especially conditioning and stability were not well understood. Hence there was too much emphasis on studying rounding errors. Today, computer arithmetic is more precise and well understood, and we can control rounding errors through stability of numerical algorithms.

The main business of numerical analysis is designing algorithms that converge quickly; rounding error analysis is rarely the central issue. If rounding errors vanished, 99% of numerical analysis would remain. - Lloyd N. Trefethen

8Propagation of errors

Let us consider the basic arithmetic operations: +,,,/+, -, *, /. The computer version of these operations is an approximation due to rounding.

:exact operation^:computer approximation\op: \textrm{exact operation} \qquad \ap: \textrm{computer approximation}

Let

x,y:real numbers,x^,y^:computer representationx, y : \textrm{real numbers}, \qquad \hatx, \haty : \textrm{computer representation}

where

x^=fl(x)=x(1+ϵx),y^=fl(y)=y(1+ϵy)\hatx = \fl{x} = x(1+\epsilon_x), \qquad \haty = \fl{y} = y (1 + \epsilon_y)

Let us write this as

x=x^+ϵ,y=y^+ηx = \hatx + \epsilon, \qquad y = \haty + \eta

Also

xy:exact value,x^^y^:computed valuex \op y : \textrm{exact value}, \qquad \hatx \ap \haty : \textrm{computed value}

where

x^^y^=fl(x^y^)\hatx \ap \haty = \fl{\hatx \op \haty}

^\ap comes about because x^y^\hatx \op \haty may not be representable in the floating point system and has to be rounded !!! The error is

xyx^^y^=[xyx^y^]propagated error+[x^y^x^^y^]rounding errorx \op y - \hatx \ap \haty = \underbrace{[x \op y - \hatx \op \hat y]} _{\textrm{propagated error}} + \underbrace{[\hatx \op \haty - \hatx \ap \haty]} _{\textrm{rounding error}}

The rounding error is bounded by the unit round

x^y^x^^y^x^y^12βt+1|\hatx \op \haty - \hatx \ap \haty| \le |\hatx \op \haty| \half \beta^{-t+1}

and the only way to control this is to use higher precision. Let us study the propagated error.

8.1Multiplication

xyx^y^=xy(xϵ)(yη)=xη+yϵϵηxy - \hatx\haty = xy - (x-\epsilon)(y-\eta) = x\eta + y\epsilon - \epsilon\eta

The relative error is

RE(x^y^):=xyx^y^xy=ϵx+ηyϵxηy=RE(x^)+RE(y^)RE(x^)RE(y^)\rel{\hatx\haty} := \frac{xy - \hatx\haty}{xy} = \frac{\epsilon}{x} + \frac{\eta}{y} - \frac{\epsilon}{x} \frac{\eta}{y} = \rel{\hatx} + \rel{\haty} - \rel{\hatx} \rel{\haty}

Hence

RE(x^y^)RE(x^)+RE(y^)\rel{\hatx\haty} \approx \rel{\hatx} + \rel{\haty}

8.2Division

Consider division of two numbers

RE(x^y^)=RE(x)RE(y)1RE(y^)\rel{\frac{\hatx}{\haty}} = \frac{\rel{x} - \rel{y}}{1 - \rel{\haty}}

If RE(y^)1\rel{\haty} \ll 1 then

RE(x^y^)RE(x^)RE(y^)\rel{\frac{\hatx}{\haty}} \approx \rel{\hatx} - \rel{\haty}

For multiplication and division, relative errors do not propagate rapidly.

8.3Addition/subtraction

Consider adding or subtracting two numbers

(x±y)(x^±y^)=(xx^)±(yy^)=ϵ+η(x \pm y) - (\hatx \pm \haty) = (x - \hatx) \pm (y - \haty) = \epsilon + \eta

The absolute error is

AE(x^±y^)=AE(x^)±AE(y^)\err{\hatx \pm \haty} = \err{\hatx} \pm \err{\haty}

The absolute error is well-behaved, but relative errors can be large if xx and yy are nearly equal and we are doing subtraction.

9Errors in summation

Suppose we want to compute the sum

sn=i=1nxis_n = \sum_{i=1}^n x_i

In a computer program, we would perform a loop and accumulate the sum sequentially. E.g., in fortran, the code may look like this:

s = 0.0
do i=1,n
   s = s + x(i)
enddo

Let us assume that all the xix_i are floating point numbers. Let s1=x1s_1 = x_1 and compute

sk+1=sk+xk+1,k=1,2,,n1s_{k+1} = s_k + x_{k+1}, \qquad k=1,2,\ldots,n-1

But a computer will perform roundoff in each step, so we actually compute

s^1=x1,s^k+1=fl(s^k+xk+1)=(s^k+xk+1)(1+ϵk+1)\hats_1 = x_1, \qquad \hats_{k+1} = \fl{\hats_k + x_{k+1}} = (\hats_k + x_{k+1})(1 + \epsilon_{k+1})

where each ϵku|\epsilon_k| \le \uround. Define the relative error

ρk=s^ksksk\rho_k = \frac{\hats_k - s_k}{s_k}

Then

ρk+1=s^k+1sk+1sk+1=(s^k+xk+1)(1+ϵk+1)(sk+xk+1)sk+1=[sk(1+ρk)+xk+1](1+ϵk+1)(sk+xk+1)sk+1=ϵk+1+ρksksk+1(1+ϵk+1)\begin{align*} \rho_{k+1} =& \frac{\hats_{k+1} - s_{k+1}}{s_{k+1}} \\ =& \frac{(\hats_k + x_{k+1})(1+\epsilon_{k+1}) - (s_k + x_{k+1})}{s_{k+1}} \\ =& \frac{[s_k(1+\rho_k) + x_{k+1}](1+\epsilon_{k+1}) - (s_k + x_{k+1})} {s_{k+1}} \\ =& \epsilon_{k+1} + \rho_k \frac{s_k}{s_{k+1}}(1 + \epsilon_{k+1}) \end{align*}

Let us assume that each xi0x_i \ge 0 so that sksk+1s_k \le s_{k+1}. Hence

ρk+1u+ρk(1+u)=u+ρkθ,θ:=1+u|\rho_{k+1}| \le \uround + |\rho_k| (1 + \uround) = \uround + |\rho_k| \theta, \qquad \theta := 1 + \uround

and so

ρ1=0ρ2uρ3u+uθ=(1+θ)uρ4u+(u+uθ)θ=(1+θ+θ2)u\begin{align*} |\rho_1| =& 0 \\ |\rho_2| \le& \uround \\ |\rho_3| \le& \uround + \uround\theta = (1 + \theta) \uround\\ |\rho_4| \le& \uround + (\uround + \uround\theta)\theta = (1 + \theta + \theta^2) \uround \\ \vdots& \end{align*}

The relative error in the sum can be bounded as

ρn(1+θ++θn2)u=[θn11θ1]u=(1+u)n11(n1)u\begin{align*} |\rho_n| \le& (1 + \theta + \ldots + \theta^{n-2}) \uround \\ =& \left[ \frac{\theta^{n-1} - 1}{\theta-1} \right] \uround \\ =& (1 + \uround)^{n-1} - 1 \\ \approx& (n-1) \uround \end{align*}

We have (n1)(n-1) additions and the above estimate says that the relative error is propertional to the number of additions. Usually this is a pessimistic estimate since we took absolute values, and in actual computations, the succesive errors can cancel, leading to much smaller total error in the sum.

9.1A closer look

Let us estimate some of the partial sums.

s^1=x1s^2=(s^1+x2)(1+ϵ2)=s2+s2ϵ2s^3=(s^2+x3)(1+ϵ3)=(s3+s2ϵ2)(1+ϵ3)=s3+s2ϵ2+s3ϵ3+O(u2)s^4=(s^3+x4)(1+ϵ4)=(s4+s2ϵ2+s3ϵ3+O(u2))(1+ϵ4)=s4+s2ϵ2+s3ϵ3+s4ϵ4+O(u2)\begin{align*} \hats_1 =& x_1 \\ \hats_2 =& (\hats_1 + x_2)(1 + \epsilon_2) = s_2 + s_2 \epsilon_2 \\ \hats_3 =& (\hats_2 + x_3)(1 + \epsilon_3) = (s_3 + s_2 \epsilon_2)(1+\epsilon_3) = s_3 + s_2 \epsilon_2 + s_3 \epsilon_3 + O(\uround^2) \\ \hats_4 =& (\hats_3 + x_4)(1+\epsilon_4) = (s_4 + s_2 \epsilon_2 + s_3 \epsilon_3 + O(\uround^2))(1 + \epsilon_4) \\ =& s_4 + s_2 \epsilon_2 + s_3 \epsilon_3 + s_4 \epsilon_4 + O(\uround^2) \\ \vdots & \end{align*}

Hence the error in the sum is

s^nsn=s2ϵ2++snϵn+O(u2)=x1(ϵ2++ϵn)+x2(ϵ2++ϵn)+x3(ϵ3++ϵn)++xnϵn\begin{align*} \hats_n - s_n =& s_2 \epsilon_2 + \ldots + s_n \epsilon_n + O(\uround^2) \\ =& x_1 (\epsilon_2 + \ldots + \epsilon_n) + x_2 (\epsilon_2 + \ldots + \epsilon_n) + x_3 (\epsilon_3 + \ldots + \epsilon_n) + \ldots + x_n \epsilon_n \end{align*}

Note that the coefficients of xix_i decrease with increasing ii. If we want to minimize the error, we can sort the numbers in increasing order x1x2xnx_1 \le x_2 \le \ldots \le x_n and then sum them. In practice, the ϵi\epsilon_i can be positive or negative, so the precise behaviour can be case dependent.

10Function evaluation

Suppose we want to compute a function value

y=f(x),xRy = f(x), \qquad x \in \re

There are two types of errors we have to deal with on a computer. Firstly, the real number has to be approximated by a floating point number

x^=fl(x)\hatx = \fl{x}

Secondly, the function ff may be computed only approximately as f^\hatf. The only computations a computer can do exactly are addition and multiplication, and even here we have roundoff errors. Any other type of computation, even division, can only be performed approximately. E.g., if f(x)=sin(x)f(x) = \sin(x), a computer will evaluate it approximately by some truncated Taylor approximation of sin(x)\sin(x). So what we actually evaluate on a computer is

y^=f^(x^)\haty = \hatf(\hatx)

The absolute error is

AE(y^)=yy^=f(x)f^(x)+f^(x)f^(x^)f(x)f^(x)+f^(x)f^(x^)\begin{align*} \err{\haty} =& |y - \haty| \\ =& |f(x) - \hatf(x) + \hatf(x) - \hatf(\hatx)| \\ \le& |f(x) - \hatf(x)| + |\hatf(x) - \hatf(\hatx)| \end{align*}

We have two sources of error.

  1. f(x)f^(x)|f(x) - \hatf(x)|: numerical discretization error, can be controlled by using an accurate algorithm
  2. f^(x)f^(x^)|\hatf(x) - \hatf(\hatx)|: transmission of round-off error by the numerical scheme, can be controlled by stability of numerical scheme and using enough precision so that xx^x - \hatx is small.