EE5121 Optimization - Newton & quasi Newton Methods

In this module, based of Chapters 3,6 of NW, we will look at Newton methods and related variants, such as quasi-Newton methods. These powerful methods are applicable to many kinds of problems, including nonlinear problems, and offer a faster rate of convergence than gradient descent or conjugate gradient methods (when they work).

Newton Method

The Newton method comes under the family of "line search" methods, with an update of the form: \(x_{k+1} = x_k + \alpha_k p_k\). We also recall the following:

\(\alpha\) is computed via some inexact line search algorithm.
The descent direction is given by the expression: \(p_k = - (\nabla^2 f_k)^{-1}\nabla f_k\).
For this direction to be a legitimate descent direction, the Hessian needs to be positive definite (PD). This is easy to see when we consider the angle between the negative gradient and the Newton direction: \(\cos \theta_k = \frac{-\nabla f_k^T p_k}{\|\nabla f_k\|\|p_k\|}\). Substituting \(p_k\) in the numerator simplifies it to: \(\nabla f_k^T (\nabla^2 f_k)^{-1}\nabla f_k\). Thus, in order that \(-\pi/2<\theta_k<\pi/2\), we must have \((\nabla^2 f_k)\) be PD (which in turn implies that its inverse is also PD).

To easily switch between Newton and quasi-Newton methods, we denote the descent direction as \(p_k = - B_k^{-1} \nabla f_k\), where:

If \(B_k = \nabla^2 f_k\), the true Hessian, we call it the Newton method,
If \(B_k\) is not the Hessian but an approxmiation to it, we call it a quasi Newton method.

Convergence

The global convergence properties of the Newton method follows in quite the same way as that of the steepest descent method (refer here), namely Zoutendijk’s condition. In particular, recall that \(\cos^2\theta_k \|\nabla f_k\|^2 \to 0\) as long as the conditions of the theorem are satisfied.

If we make a further assumption that \(B_k\) has a bounded condition number (apart from being PD), we get: \(\kappa = \|B_k\|\|B_k^{-1}\| < M\) for all \(k\). This in turn implies that \(\cos\theta_k \geq 1/M\), which in turn implies that \(\lim_{k\to \infty}\nabla f_k \to 0 \), thus giving global convergence.

Proof

Returning to the expression for \(\cos\theta_k\) above, we see that (dropping \(k\)): \(\cos\theta = \frac{\nabla f^T p}{\|\nabla f\|\|p\|}\). We wil use a few properties:

For a PD matrix, \(B\) is invertible, \(B^{1/2},B^{-1/2}\) exist, and \(\|B^{1/2}\|=\|B\|^{1/2}\).
the property of Euclidean norms gives us \(\|AB\|\leq \|A\| \|B\|\)

Next, replace \(\nabla f = -Bp\). This gives:

\(\cos\theta = \frac{ p^T Bp}{\|Bp\|\|p\|} \geq \frac{p^T Bp}{\|B\|\|p\|^2}=\frac{\|B^{1/2}p\|^2}{\|B\|\|p\|^2} \geq \frac{\|B^{1/2}p\|^2}{\|B\|(\|B^{-1/2}\|\|B^{1/2}p\|)^2} = \frac{1}{\|B\| \|B\|^{-1}} \geq \frac{1}{M}\).

Rate of convergence

Since it is not guaranteed that the Hessian is always positive definite, it is not possible to ascertain the rate of convergence without first dealing with strategies of making the Hessian PD. However, in the proximity of a solution \(x^*\) it is known that \(H\) is PD, and in such a viscinity Newton methods demonstrate quadratic convergence.

It can be shown that if \(f\) is twice differentiable and the Hessian is Lipschitz continuous in a neighborhood of \(x^*\) (where second order sufficiency conditions on \(x^*\) are satisfied), then the sequence of iterates converges quadratically to \(x^*\), provided the starting point \(x_0\) is sufficiently close to \(x^*\).

The proof can be seen in Th. 3.5 of NW, and we state the final result here:

\begin{equation} \|x_{k+1} - x^*\| \leq L \|\nabla^2 f(x^*)^{-1} \| \|x-x^*\|^2, \end{equation}

where \(L\) is the Lipschitz constant for \(\nabla^2 f(x)\) near \(x^*\).

Hessian modification

A very important question is: what does one do when the Hessian is not PD. This is a vast area in itself, covering ideas from eigenvalue modification, to perturbing the diagonal elements of the Hessian, etc, and will not be covered in detail in this course. Interested readers are referred to Sec 3.4 of NW. Instead, we give one simple example of Hessian modification to enable a more general use of Newton’s method.

First, recall a result from linear algebra:

Cholesky Decomposition: If \(A\) is a real symmetric PD matrix, then it has a unique decomposition of the form: \(A = LL^T\), where \(L\) is a lower triangular matrix with positive diagonal entries. The converse also holds, i.e. if a matrix can be expressed as \(LL^T\) for some invertible \(L\), then it is symmetric PD.

Now, consider a problem where the Hessian (call it \(A=\nabla^2 f_k\)) is symmetric, but not positive definite. We can ask for a small perturbation \(\Delta A\) to \(A\) such that the resulting \(A+\Delta A\) is sym PD. The smallness of the perturbation needs to be specified; for e.g. if we want \(\Delta A\) to have minimum 2-norm such that \(\lambda_{min}(A + \Delta A) \geq \delta\) (for some user specified small constant \(\delta\)), then a safe bet is:

\(\Delta A = \tau I\), where \(\tau = max(0,\delta - \lambda_{min}(A))\).

Proof

The 2-norm of a matrix is \(\|A\|_2 = \sigma_1\), i.e. the highest singular value (when a matrix is sym PD, the eigen and singular values coincide). Assume that \(A\) is symm, which gives it an EVD \(=Q\Lambda Q^T\), and add \(\Delta A\) s.t. \(A + \Delta A\) becomes sym PD, for the most economical \(\|\Delta A\|_2\). Clearly, for \(A + \Delta A\) to become PD, the lowest eigenvalue of \(A\) needs to become positive. This is accomplished by choosing \(\Delta A = \tau I\), since: \(A + \Delta A = Q(\Lambda + \tau I)Q^T\) (recall that \(I=QQ^T\) due to orthogonality of \(Q\)), and setting \(\tau = max(0,\delta - \lambda_{min}(A))\). Obviously, \(\|\Delta A\|_2 = \tau\).

While this suggests a deceptively simple choice for \(\tau\), in practice computing the eigenvalues of \(A\) is to be avoided.

So instead, we can think of an iterative procedure, where we start with a small value of \(\tau\) initially, and increment it till the point \(A\) becomes sym PD. We know that we have reached the goal when the Cholesky decomposition is successful! See algorithm 3.3 of NW for a specific implementation. The price to pay is a trial Cholesky decomposition at each iteration (cost is half that of a LU decomposition). Ideas from the Gershgorin circle theorem can be used to seed the initial value (this theorem gives us a very inexpensive estimate of the range of values within which the eigenvalues will like).

Quasi Newton methods

In many situations it is expensive to compute the Hessian. Instead, in quasi Newton methods, only gradient information is used to construct an approximation of the Hessian. Doing so improves the convergence rate from linear (for steepest descent) to superlinear.

The intuition behind quasi Newton methods is as follows. At a current iterate \(k\), we can construct a quadratic model to approximate the objective function like so: \( m_k(p) = f_k + \nabla f_k^Tp + \frac{1}{2} p^T B_k p \), where \(B_k\) is a sym PD matrix that is also updated at each iteration. Of course, \(m_k\) agrees with the function value and gradient at \(p=0\). As in the Newton method, the minimizer of this quadratic function happens when: \(p_k = - B_k^{-1} \nabla f_k\). Once this is found, we iterate as usual by: \(x_{k+1} = x_k + \alpha_k p_k\), where \(\alpha_k\) is chosen as per the Wolfe conditions.

Now, computing \(B_k\) afresh at each iteration is expensive, so the innovation in quasi Newton methods is to update it cleverly in each iteration. To see this, consider the model at the next iteration: \(m_{k+1}(p) = f_{k+1} + \nabla f_{k+1}^Tp + \frac{1}{2} p^T B_{k+1} p \). We would like to update \(B_{k+1}\) based on information from the previous iteration.

One possible way is to insist that \(m_{k+1}\) match the gradient of \(f\) at both \(x_k\) and \(x_{k+1}\). The latter is trivially true by construction of \(m_{k+1}\). To get the gradient at \(x_{k}\) from \(m_{k+1}(p)\), we need to substitute \(p = -\alpha_k p_k\). This gives: \(\nabla m_{k+1}(-\alpha_k p_k) = \nabla f_{k+1} - \alpha_k B_{k+1} p_k = \nabla f_k\). The first equality is from the definition of \(m_{k+1}\), the second is our demand.

Rearranging, we get: \(B_{k+1} (\alpha_k p_k) = \nabla f_{k+1} - \nabla f_k\). This notation is simplified by noting: \(s_k = x_{k+1} - x_k\) and \(y_k = \nabla f_{k+1} - \nabla f_k\), to give: \(B_{k+1} s_k = y_k\) (also called the secant equation). Since we want \(B_{k+1}\) to be sym PD, we can left-multiply by \(s_k^T\) to see that the following must hold: \(s_k^T B_{k+1} s_k = s_k^T y_k > 0 \). We state without proof that this condition gets satisfied when \(\alpha_k\) is chosen as per the Wolfe conditions (see Ch 6.1 of NW).

What remains is to actually find a \(B\) that satisfies the above relation. Some approaches instead work with the inverse of the Hessian instead, i.e. \(H = B^{-1}\), thus transforming the secant relation to: \(H_{k+1} y_k = s_k\).

It can be seen that while \(B\) (or \(H\)) have \(n(n+1)/2\) free variables, the secant equation only imposes \(n\) conditions. Thus, with this wiggle room for fixing the extra degrees of freedom, different solutions of \(B_{k+1}\) (or \(H_{k+1}\)) are possible. The most popular among them is the so-called BFGS relation:

\( H_{k+1} = (I - \rho_k s_k y_k^T) H_k ( I - \rho_k s_k y_k^T) + \rho_k s_k s_k^T \) with \(\rho_k = 1/(y_k^T s_k)\).

Thus, the BFGS proceeds as a line search method (with some initial approximation for \(H_0\)), and an update \(p_k = - H_k \nabla f_k\) along with a line search for \(\alpha_k\) that satisfies the Wolfe conditions. This completes the intuition behind the BFGS method, but there are several crucial implementation details to be kept in mind. The reader is encouraged to go through Ch 6.1 of NW for this.