1 Quasi-Newton methods ¶ Newton method requires the knowledge of the gradient. In many situations, it may not be possible to find the gradient. E.g., if f ( x ) f(x) f ( x ) is available only as a computer code, then it may not be possible to compute the gradient[1] . In such situations we can try to approximate the gradient and methods which make use of approximate derivatives are called quasi-Newton methods . For a scalar function f ( x ) f(x) f ( x ) , we can approximate the derivative using a finite difference approximation, e.g.,
f ′ ( x k ) ≈ f ( x k + h ) − f ( x k ) h , 0 ≤ h ≪ 1 f'(x_k) \approx \frac{f(x_k+h) - f(x_k)}{h}, \qquad 0 \le h \ll 1 f ′ ( x k ) ≈ h f ( x k + h ) − f ( x k ) , 0 ≤ h ≪ 1 Since we want to generate a sequence of approximations to the root, we can use two successive values of x k x_k x k to estimate the derivative
f ′ ( x k ) ≈ f ( x k ) − f ( x k − 1 ) x k − x k − 1 f'(x_k) \approx \frac{f(x_k) - f(x_{k-1})}{x_k - x_{k-1}} f ′ ( x k ) ≈ x k − x k − 1 f ( x k ) − f ( x k − 1 ) Using this derivative in place of the exact derivative in Newton method leads to
the secant method .
2 Secant method ¶ Given two starting values x 0 x_0 x 0 , x 1 x_1 x 1 which are sufficiently close to the root α, we construct a straight line approximation to f ( x ) f(x) f ( x ) passing through the points ( x 0 , f ( x 0 ) ) (x_0,f(x_0)) ( x 0 , f ( x 0 )) and ( x 1 , f ( x 1 ) ) (x_1, f(x_1)) ( x 1 , f ( x 1 )) . We find the zero of this straight line which will form the next approximation to the root x 2 x_2 x 2 and is given by
x 2 = x 1 − f ( x 1 ) x 1 − x 0 f ( x 1 ) − f ( x 0 ) x_2 = x_1 - f(x_1) \frac{x_1 - x_0}{f(x_1) - f(x_0)} x 2 = x 1 − f ( x 1 ) f ( x 1 ) − f ( x 0 ) x 1 − x 0 Inductively, we obtain
x k + 1 = x k − f ( x k ) x k − x k − 1 f ( x k ) − f ( x k − 1 ) x_{k+1} = x_k - f(x_k) \frac{x_k - x_{k-1}}{f(x_k) - f(x_{k-1})} x k + 1 = x k − f ( x k ) f ( x k ) − f ( x k − 1 ) x k − x k − 1 Note that this is just Newton iteration formulae where the derivative has been approximated by the finite difference formula.
3 Convergence analysis ¶ From the iteration formula, we get by subtracting α on both sides
α − x k + 1 = α − x k + f ( x k ) x k − x k − 1 f ( x k ) − f ( x k − 1 ) \alpha - x_{k+1} = \alpha - x_k + f(x_k) \frac{x_k - x_{k-1}}{f(x_k) - f(x_{k-1})} α − x k + 1 = α − x k + f ( x k ) f ( x k ) − f ( x k − 1 ) x k − x k − 1 Define the Newton divided differences
f [ x k − 1 , x k ] : = f ( x k ) − f ( x k − 1 ) x k − x k − 1 , f [ x k − 1 , x k , α ] : = f [ x k , α ] − f [ x k − 1 , x k ] α − x k − 1 f[x_{k-1},x_k] := \frac{f(x_k) - f(x_{k-1})}{x_k - x_{k-1}}, \qquad f[x_{k-1},x_k,
\alpha]:= \frac{f[x_k,\alpha] - f[x_{k-1},x_k]}{\alpha - x_{k-1}} f [ x k − 1 , x k ] := x k − x k − 1 f ( x k ) − f ( x k − 1 ) , f [ x k − 1 , x k , α ] := α − x k − 1 f [ x k , α ] − f [ x k − 1 , x k ] then it is easy to show that
f [ x k − 1 , x k ] = f ′ ( y k ) , for some y k between x k − 1 , x k f[x_{k-1},x_k] = f'(y_k), \qquad \textrm{for some $y_k$ between $x_{k-1}, x_k$} f [ x k − 1 , x k ] = f ′ ( y k ) , for some y k between x k − 1 , x k and
f [ x k − 1 , x k , α ] = 1 2 f ′ ′ ( z k ) for some z k between x k − 1 , x k , α f[x_{k-1},x_k,\alpha] = \half f''(z_k) \qquad \textrm{for some $z_k$ between
$x_{k-1},x_k,\alpha$} f [ x k − 1 , x k , α ] = 2 1 f ′′ ( z k ) for some z k between x k − 1 , x k , α Hence we get the error formula
α − x k + 1 = − ( α − x k − 1 ) ( α − x k ) f [ x k − 1 , x k , α ] 2 f [ x k − 1 , x k ] = − ( α − x k − 1 ) ( α − x k ) f ′ ′ ( z k ) 2 f ′ ( y k ) \alpha - x_{k+1} = -(\alpha - x_{k-1})(\alpha - x_k) \frac{f[x_{k-1},x_k,\alpha]}
{2f[x_{k-1},x_k]} = -(\alpha - x_{k-1})(\alpha - x_k)\frac{f''(z_k)}{2f'(y_k)} α − x k + 1 = − ( α − x k − 1 ) ( α − x k ) 2 f [ x k − 1 , x k ] f [ x k − 1 , x k , α ] = − ( α − x k − 1 ) ( α − x k ) 2 f ′ ( y k ) f ′′ ( z k ) If we can bound the last factor
∣ f ′ ′ ( z k ) 2 f ′ ( y k ) ∣ ≤ M < ∞ \left| \frac{f''(z_k)}{2f'(y_k)} \right| \le M < \infty ∣ ∣ 2 f ′ ( y k ) f ′′ ( z k ) ∣ ∣ ≤ M < ∞ then we get a bound on the error e k = ∣ α − x k ∣ e_k = |\alpha - x_k| e k = ∣ α − x k ∣
e k + 1 ≤ M e k − 1 e k e_{k+1} \le M e_{k-1} e_k e k + 1 ≤ M e k − 1 e k Assume that f ( x ) f(x) f ( x ) , f ′ ( x ) f'(x) f ′ ( x ) , f ′ ′ ( x ) f''(x) f ′′ ( x ) are continuous for all values of x x x in some interval around α, and that f ( α ) = 0 f(\alpha)=0 f ( α ) = 0 , f ′ ( α ) ≠ 0 f'(\alpha) \ne 0 f ′ ( α ) = 0 . Then if the initial guesses x 0 x_0 x 0 , x 1 x_1 x 1 are sufficiently close to α, then the iterates of the secant method will converge to α. The order of convergence will be
p = 1 2 ( 1 + 5 ) ≈ 1.62 p = \half (1 + \sqrt{5}) \approx 1.62 p = 2 1 ( 1 + 5 ) ≈ 1.62 (1) Pick a sufficiently small interval I = [ α − δ , α + δ ] I = [\alpha-\delta, \alpha+\delta] I = [ α − δ , α + δ ] on which f ′ ( x ) ≠ 0 f'(x) \ne 0 f ′ ( x ) = 0 . This is possible to do since f ′ ( α ) ≠ 0 f'(\alpha) \ne 0 f ′ ( α ) = 0 and f ′ f' f ′ is
continuous. Then define
M = max x ∈ I ∣ f ′ ′ ( x ) ∣ 2 min x ∈ I ∣ f ′ ( x ) ∣ < ∞ M = \frac{\max_{x \in I} |f''(x)|}{2\min_{x \in I} |f'(x)|} < \infty M = 2 min x ∈ I ∣ f ′ ( x ) ∣ max x ∈ I ∣ f ′′ ( x ) ∣ < ∞ For any x 0 , x 1 ∈ I x_0, x_1 \in I x 0 , x 1 ∈ I , the errors in secant method satisfy
e 2 ≤ e 1 ⋅ M e 0 M e 2 ≤ M e 1 ⋅ M e 0 \begin{aligned}
e_2 &\le& e_1 \cdot M e_0 \\
M e_2 &\le& Me_1 \cdot Me_0
\end{aligned} e 2 M e 2 ≤ ≤ e 1 ⋅ M e 0 M e 1 ⋅ M e 0 Further assume that x 0 x_0 x 0 , x 1 x_1 x 1 are chosen such that
Δ = max { M e 0 , M e 1 } < 1 \Delta = \max\{ Me_0 , Me_1\} < 1 Δ = max { M e 0 , M e 1 } < 1 Then
M e 2 ≤ Δ 2 < 1 Me_2 \le \Delta^2 < 1 M e 2 ≤ Δ 2 < 1 Also
M e 2 ≤ Δ 2 < Δ M e_2 \le \Delta^2 < \Delta M e 2 ≤ Δ 2 < Δ ⟹ e 2 < Δ M = max { e 0 , e 1 } ≤ δ ⟹ x 2 ∈ I \begin{aligned}
& \implies e_2 < \frac{\Delta}{M} = \max\{e_0, e_1\} \le \delta \\
& \implies x_2 \in I
\end{aligned} ⟹ e 2 < M Δ = max { e 0 , e 1 } ≤ δ ⟹ x 2 ∈ I Applying the above argument inductively, we get
x k ∈ I M e k ≤ Δ < 1 } for k = 0 , 1 , 2 , 3 , … \left.
\begin{aligned}
x_k & \in I \\
M e_k & \le \Delta < 1
\end{aligned}
\right\} \quad \textrm{for } k = 0,1,2,3,\ldots x k M e k ∈ I ≤ Δ < 1 } for k = 0 , 1 , 2 , 3 , … To prove convergence, we apply the error inequality recursively.
M e 0 ≤ Δ M e 1 ≤ Δ M e 2 ≤ M e 1 ⋅ M e 0 ≤ Δ 2 M e 3 ≤ M e 2 ⋅ M e 1 ≤ Δ 3 M e 4 ≤ M e 3 ⋅ M e 2 ≤ Δ 5 M e 5 ≤ M e 4 ⋅ M e 3 ≤ Δ 8 \begin{aligned}
M e_0 & \le \Delta \\
M e_1 & \le \Delta \\
M e_2 & \le Me_1 \cdot Me_0 \le \Delta^2 \\
M e_3 & \le M e_2 \cdot M e_1 \le \Delta^3 \\
M e_4 & \le M e_3 \cdot M e_2 \le \Delta^5 \\
M e_5 & \le M e_4 \cdot M e_3 \le \Delta^8
\end{aligned} M e 0 M e 1 M e 2 M e 3 M e 4 M e 5 ≤ Δ ≤ Δ ≤ M e 1 ⋅ M e 0 ≤ Δ 2 ≤ M e 2 ⋅ M e 1 ≤ Δ 3 ≤ M e 3 ⋅ M e 2 ≤ Δ 5 ≤ M e 4 ⋅ M e 3 ≤ Δ 8 In general
M e k ≤ Δ q k M e_k \le \Delta^{q_k} M e k ≤ Δ q k with
q 0 = q 1 = 1 , q k = q k − 1 + q k − 2 q_0 = q_1 = 1, \qquad q_k = q_{k-1} + q_{k-2} q 0 = q 1 = 1 , q k = q k − 1 + q k − 2 We see that q k q_k q k is a Fibonacci sequence whose k k k ’th term can be written as
q k = 1 5 [ r 0 k + 1 − r 1 k + 1 ] , k ≥ 0 q_k = \frac{1}{\sqrt{5}}[ r_0^{k+1} - r_1^{k+1}], \qquad k \ge 0 q k = 5 1 [ r 0 k + 1 − r 1 k + 1 ] , k ≥ 0 where
r 0 = 1 2 ( 1 + 5 ) ≈ 1.618 , r 1 = 1 2 ( 1 − 5 ) ≈ − 0.618 r_0 = \half(1 + \sqrt{5}) \approx 1.618, \qquad r_1 = \half (1 - \sqrt{5}) \approx -0.618 r 0 = 2 1 ( 1 + 5 ) ≈ 1.618 , r 1 = 2 1 ( 1 − 5 ) ≈ − 0.618 Since Δ < 1 \Delta < 1 Δ < 1 and q k → ∞ q_k \to \infty q k → ∞ as k → ∞ k \to \infty k → ∞ , we get
lim k → ∞ e k ≤ 1 M Δ q k = 0 \lim_{k \to \infty} e_k \le \frac{1}{M} \Delta^{q_k} = 0 k → ∞ lim e k ≤ M 1 Δ q k = 0 Hence the secant method converges to some root.
(2) Now let us estimate the convergence rate. Assume that
e k ≈ A e k − 1 p e_k \approx A e_{k-1}^p e k ≈ A e k − 1 p and we will try to estimate p p p . Now
e k − 1 ≈ ( e k A ) 1 / p e_{k-1} \approx \left( \frac{e_k}{A} \right)^{1/p} e k − 1 ≈ ( A e k ) 1/ p Using the error relation for secant method
e k + 1 ≈ M e k ⋅ e k − 1 ≈ M e k ⋅ 1 A 1 / p e k 1 / p e_{k+1} \approx M e_k \cdot e_{k-1} \approx M e_k \cdot \frac{1}{A^{1/p}} e_k^{1/
p} e k + 1 ≈ M e k ⋅ e k − 1 ≈ M e k ⋅ A 1/ p 1 e k 1/ p Hence
A e k p ≈ M A − 1 / p e k 1 + 1 / p A e_k^p \approx M A^{-1/p} e_k^{1 + 1/p} A e k p ≈ M A − 1/ p e k 1 + 1/ p This implies
that
p = 1 + 1 p and A ≈ M A − 1 / p p = 1 + \frac{1}{p} \qquad \textrm{and} \qquad A \approx M A^{-1/p} p = 1 + p 1 and A ≈ M A − 1/ p The first equation p 2 − p − 1 = 0 p^2 -p - 1 = 0 p 2 − p − 1 = 0 has the roots
p = 1 ± 5 2 p = \frac{1 \pm \sqrt{5}}{2} p = 2 1 ± 5 We choose the positive root since we need p > 1 p > 1 p > 1 , hence the convergence rate is
p = 1 + 5 2 ≈ 1.618 p = \frac{1 + \sqrt{5}}{2} \approx 1.618 p = 2 1 + 5 ≈ 1.618 This is less than quadratic convergence, but still better than linear convergence. From the second relation we get
A ≈ M p / ( 1 + p ) ≈ ∣ f ′ ′ ( α ) 2 f ′ ( α ) ∣ 5 − 1 2 A \approx M^{p/(1+p)} \approx \left| \frac{ f''(\alpha)}{2f'(\alpha)} \right|
^{\frac{\sqrt{5} - 1}{2}} A ≈ M p / ( 1 + p ) ≈ ∣ ∣ 2 f ′ ( α ) f ′′ ( α ) ∣ ∣ 2 5 − 1 4 Error estimate ¶ We can use the function value to decide on convergence but this may not
be a good criterion if the function is very flat around the root.
Additionally we can estimate the error in the root also. From mean value
theorem
f ( x k ) = f ( x k ) − f ( r ) = f ′ ( ξ k ) ( x k − r ) f(x_k) = f(x_k) - f(r) = f'(\xi_k) (x_k - r) f ( x k ) = f ( x k ) − f ( r ) = f ′ ( ξ k ) ( x k − r ) where ξ k \xi_k ξ k is between x k x_k x k and r r r .
r − x k = − f ( x k ) f ′ ( ξ k ) r - x_k = - \frac{f(x_k)}{f'(\xi_k)} r − x k = − f ′ ( ξ k ) f ( x k ) If f ′ f' f ′ is not changing rapidly between x k x_k x k and r r r then
f ′ ( ξ k ) ≈ f ( x k + 1 ) − f ( x k ) x k + 1 − x k f'(\xi_k) \approx \frac{f(x_{k+1}) - f(x_k)}{x_{k+1} - x_k} f ′ ( ξ k ) ≈ x k + 1 − x k f ( x k + 1 ) − f ( x k ) We get the error estimate
r − x k ≈ − f ( x k ) f ( x k + 1 ) − f ( x k ) x k + 1 − x k = x k + 1 − x k r - x_k \approx - \frac{f(x_k)}{\frac{f(x_{k+1}) - f(x_k)}{x_{k+1} - x_k}} = x_{k+1}
- x_k r − x k ≈ − x k + 1 − x k f ( x k + 1 ) − f ( x k ) f ( x k ) = x k + 1 − x k and the relative error estimate
r − x k r ≈ x k + 1 − x k x k + 1 \frac{r - x_k}{r} \approx \frac{x_{k+1} - x_k}{x_{k+1}} r r − x k ≈ x k + 1 x k + 1 − x k Hence we can use the test
∣ x k + 1 − x k ∣ < ϵ ∣ x k + 1 ∣ , 0 < ϵ ≪ 1 |x_{k+1} - x_k| < \epsilon |x_{k+1}|, \qquad 0 < \epsilon \ll 1 ∣ x k + 1 − x k ∣ < ϵ ∣ x k + 1 ∣ , 0 < ϵ ≪ 1 to decide on convergence. This is used in the code examples to stop the
iterations.
Compute derivative of
f ( x ) = sin ( x ) f(x) = \sin(x) f ( x ) = sin ( x ) at x = 2 π x=2\pi x = 2 π using the finite difference formula
f ( x + h ) − f ( x ) h \frac{ f(x+h) - f(x) }{h} h f ( x + h ) − f ( x ) for h = 1 0 − 1 , 1 0 − 2 , … , 1 0 − 14 h = 10^{-1}, 10^{-2}, \ldots, 10^{-14} h = 1 0 − 1 , 1 0 − 2 , … , 1 0 − 14 .
f = lambda x: sin(x)
h = 10.0**arange(-1,-15,-1)
df= zeros(len(h))
x = 2.0*pi
f0= f(x)
for i in range(len(h)):
f1 = f(x+h[i])
df[i] = (f1 - f0)/h[i]
loglog(h,abs(df-1.0),'o-')
xlabel('h')
ylabel('Error in derivative');
Initially, as h h h decreases, the error in the FD approximation decreases, but for after some value, the error starts to increase with further decrease in h h h . Clearly, there seems to be an optimal value of h h h for which the error is least.
5 Accuracy of FD approximation ¶ The secant method can be written as
x k + 1 = x k − f ( x k ) g k , g k = f ( x k ) − f ( x k − 1 ) x k − x k − 1 x_{k+1} = x_k - \frac{f(x_k)}{g_k}, \qquad g_k = \frac{f(x_k) - f(x_{k-1})}{x_k - x_{k-1}} x k + 1 = x k − g k f ( x k ) , g k = x k − x k − 1 f ( x k ) − f ( x k − 1 ) As x k → α x_k \to \alpha x k → α , the quantities x k − x k − 1 x_k - x_{k-1} x k − x k − 1 and
f ( x k ) − f ( x k − 1 ) f(x_k) - f(x_{k-1}) f ( x k ) − f ( x k − 1 ) can suffer from round-off error and g k g_k g k can then
be inaccurate. Let us analyze this using Taylor expansion for the finite
difference approximation
f ( x + h ) − f ( x ) h = f ′ ( x ) + 1 2 h f ′ ′ ( ξ ) , ξ between x and x + h \frac{f(x+h) - f(x)}{h} = f'(x) + \half h f''(\xi), \qquad \textrm{$\xi$ between $x$ and $x+h$} h f ( x + h ) − f ( x ) = f ′ ( x ) + 2 1 h f ′′ ( ξ ) , ξ between x and x + h We expect that as h → 0 h \to 0 h → 0 , we get a more accurate approximation of the derivative. However, this assumes that we can evaluate f ( x ) f(x) f ( x ) exactly and the arithmetic is exact, both of which are not true in a computer. What we compute on the computer is actually (there is also roundoff error in subtraction which we ignore)
f ^ ( x + h ) − f ^ ( x ) h \frac{\hat{f}(x+h) - \hat{f}(x)}{h} h f ^ ( x + h ) − f ^ ( x ) where, assuming f ( x ) = O ( 1 ) f(x) = O(1) f ( x ) = O ( 1 )
f ^ ( x + h ) = f ( x + h ) + ϵ 2 , f ^ ( x ) = f ( x ) + ϵ 1 \hat{f}(x+h) = f(x+h) + \epsilon_2, \quad \hat{f}(x) = f(x) + \epsilon_1 f ^ ( x + h ) = f ( x + h ) + ϵ 2 , f ^ ( x ) = f ( x ) + ϵ 1 and
ϵ 1 , ϵ 2 = O ( u ) , the unit round of the computer \epsilon_1, \epsilon_2 = O(\uround), \quad \textrm{the unit round of the computer} ϵ 1 , ϵ 2 = O ( u ) , the unit round of the computer Hence
f ^ ( x + h ) − f ^ ( x ) h = f ( x + h ) − f ( x ) h + C u h = f ′ ( x ) + 1 2 h f ′ ′ ( ξ ) + C u h \begin{aligned}
\frac{\hat{f}(x+h) - \hat{f}(x)}{h} &=& \frac{f(x+h) - f(x)}{h} + \frac{C\uround}{h} \\
&=& f'(x) + \half h f''(\xi) + \frac{C\uround}{h}
\end{aligned} h f ^ ( x + h ) − f ^ ( x ) = = h f ( x + h ) − f ( x ) + h C u f ′ ( x ) + 2 1 h f ′′ ( ξ ) + h C u The error consists of discretization error and round-off
error. As h → 0 h \to 0 h → 0 , the discretization error decreases but the
round-off error increases. There is an optimum value of h h h for which
the total error is minimized, given by
h o p t = ( 2 C u f ′ ′ ) 1 2 h_{opt} = \left( \frac{2C \uround}{f''} \right)^\half h o pt = ( f ′′ 2 C u ) 2 1 For the above finite difference approximation, the optimum step size h = O ( u ) h = O(\sqrt{\uround}) h = O ( u ) . Based on this analysis, we can modify the secant
method as follows. If
∣ x k − x k − 1 ∣ > ∣ x k − 1 + x k 2 ∣ u |x_k - x_{k-1}| > \left| \frac{x_{k-1} + x_k}{2} \right| \sqrt{\uround} ∣ x k − x k − 1 ∣ > ∣ ∣ 2 x k − 1 + x k ∣ ∣ u use the g k g_k g k as given above. Otherwise
g k = f ( x k + h ) − f ( x k ) h , h = ∣ x k ∣ u g_k = \frac{f(x_k + h) - f(x_k)}{h}, \qquad h = |x_k| \sqrt{\uround} g k = h f ( x k + h ) − f ( x k ) , h = ∣ x k ∣ u 6 Complex step method ¶ We can take a complex step instead of a real step. Taylor expansion with
a small complex step yields
f ( x + i h ) = f ( x ) + i h f ′ ( x ) − 1 2 h 2 f ′ ′ ( x ) + O ( i h 3 ) f(x+\ii h) = f(x) + \ii h f'(x) - \half h^2 f''(x) + O(\ii h^3) f ( x + i h ) = f ( x ) + i h f ′ ( x ) − 2 1 h 2 f ′′ ( x ) + O ( i h 3 ) Hence
imag f ( x + i h ) = h f ′ ( x ) + O ( h 3 ) \imag f(x+\ii h) = h f'(x) + O(h^3) imag f ( x + i h ) = h f ′ ( x ) + O ( h 3 ) and
imag f ( x + i h ) h = f ′ ( x ) + O ( h 2 ) \frac{\imag f(x+\ii h)}{h} = f'(x) + O(h^2) h imag f ( x + i h ) = f ′ ( x ) + O ( h 2 ) is a second order
accurate approximation to the derivative. This formula does not involve
subtraction of near equal quantities and hence it does not suffer from
round-off error even if we take h h h to be very small. In fact it is
perfectly fine to take h h h smaller than machine precision, e.g.,
h = 1 0 − 20 h = 10^{-20} h = 1 0 − 20 .