EE5121 Optimization - Least squares optimization

In this Chapter we will discuss some special types of optimization problems that occur often in engineering problems with a general form as \(f(x)=\frac{1}{2}\sum_j r_j^2(x)\), where each "residual" term, \(r_i\), is a smooth function function. This coverage is based on Ch 10 of NW.

Setup

An example

Let us take an example to motivate the least square problems and show their ubiquity. Say that we are measuring radioactive decay in the viscinity of a nuclear reactor and have a geiger counter with us. Let us further say that physics has given us a model of this decay to be of the form: \(y(x,t) = x_1 + x_2 \exp(-x_3 t)\), where \(x_1,x_2,x_3\) are the unknowns of the problem that we want to determine (perhaps doing so helps us determine the radioactive element and health hazards). You can imagine making several measurements \(\{y_j\}_{j=1}^m\) at various time instances, say \(\{t_j\}_{j=1}^m\), to determine these 3 unknowns. Thus, from the point of view of an optimization problem, what we really want are the model parameters that best explain the data. There is no unique way to define this "best" explanation (we will explore this more later), but a common measure is to minimize the squared error, i.e.

\begin{equation} \hat{x} = \underset{x}{\text{argmin}} \, \sum_{j=1}^m \,(y_j - y(x,t_j))^2,\quad x=\{x_1,x_2,x_3\}. \end{equation}

This problem formulation thus motivates the term "least squares" in the title. It also allows us to explain the difference between linear and nonlinear problems; if the term \(y(x,t_j)\) is an affine function of \(x\), then we term the problem linear, else it is nonlinear. Note that in this particular example, the functional dependence on time \(t\) is unimportant, so a problem of the form \(y(x,t) = x_1 + x_2 t +x_3 t^2\) is a linear least squares problem in the parameters \(x\). The radioactive decay problem chosen above is clearly nonlinear. No points for guessing which problems are easier to solve!

Formalism

Let us make our system a bit more formal; doing so will allow us to frame a large number of problems, linear and nonlinear, into the same formalism. We revisit our general statement of the objective function of a least squares problem as:

\(f(x) = \frac{1}{2}\sum_{j=1}^m r_j^2(x)\), where \(x\in\mathbb{R}^n\).

As we saw above, the number of residuals, \(m\), in many experimental scenarios corresponds to the number of data points available in an experiment. Of course, \(n\) refers to the number of variables in the problem. In the language of the example above, each residual is like one measurement of the form \(r_j(x) = y(x,t_j) - y_j\).

Usually, the scenario encountered is that \(m \geq n\), such that there is enough data available to determine unknown \(x\). Cases when insufficient data is available, i.e. \(m<n\), are harder to solve, though often they can be solved provided some apriori information about the solution is known. A prime example of this is the area of compressive sensing.

Calculus

We need to work out convenient expressions for gradients and Hessians if we are to do some optimization on our objective function. First, define \(r(x) = [r_1(x), r_2(x),\ldots,r_m(x)]^T \), i.e. a column vector of residuals. Immediately we see that \(f(x) = \frac{1}{2}\|r(x)\|^2\). Now, see that \(r(x):\mathbb{R}^n \rightarrow \mathbb{R}^m\), thus we can compute the \(m\times n\) Jacobian \(J\) of this transformation as:

\begin{equation} J_{ji} = \left(\frac{\partial r_j}{\partial x_i}\right) = \begin{bmatrix} \nabla r_1(x)^T \\ \vdots \\ \nabla r_m(x)^T \end{bmatrix},\quad i\in[1,n],\,j\in [1,m] \end{equation}

Now we can build up the gradient of \(f\) by staring with computing \(\frac{\partial f}{\partial x_i} = \sum_j r_j(x) \frac{\partial r_j(x)}{\partial x_i}\). From here, building up the partial derivatives gives us: \( [\frac{\partial f}{\partial x_1},\ldots,\frac{\partial f}{\partial x_n}] = r(x)^T J(x)\), leading to:

\begin{equation} \nabla f(x) = J(x)^T r(x) \end{equation}

Similarly, we can compute the Hessian by noting: \(H_{ik} = \frac{\partial^2 f}{\partial x_i \partial x_k} = \sum_j \frac{\partial r_j(x)}{\partial x_k} \frac{\partial r_j(x)}{\partial x_i} + \sum_j r_j(x) \frac{\partial^2 r_j(x)}{\partial x_i x_k}\). A little more algebra will give us:

\begin{equation} \nabla^2 f(x) = J(x)^T J(x) + \sum_j r_j(x) \nabla^2 r_j(x) \end{equation}

This last expression reveals a unique aspect of least-squares problems. Notice that a part of the Hessian is available from first order information itself. In many cases it happens that either the residuals or their Hessians are low in magnitude due to which the second term maybe ignored, and many nonlinear least squares problems exploit this aspect in solution strategies. Obviously, in linear least squares problems the second term is identically zero.

Why sum of squares?

In passing, we mention that one of the motivations for choosing a sum-of-squares approach comes from statistical reasoning. In particular, if we assume that the discrepancies between the observations and the model predictions are assumed to be independent and identically distributed with a normal distribution function, then minimizing the sum of squares leads to a maximum likelihood estimate.

Other metrics are also possible (and often used), such as maximum absolute value, or a sum of absolute values. These in turn are related to the \(\|r\|_{\infty}\) and \(\|r\|_1\) norms.

Linear least squares

In linear problems, the residuals, \(r_j(x)\), are an affine function of \(x\) as mentioned earlier, i.e. \(r(x) = Ax - y\) where \(A\in\mathbb{R}^{m\times n}, y\in\mathbb{R}^m\). As before, the objective function is \(f(x) = \frac{1}{2} \|Ax-y\|^2\). Computing the gradient of \(f\) gives

\(\nabla f(x) = A^T(Ax-y) = A^T r(x)\).

Comparing this with the general expression for the gradient suggests that \(A = J\), the Jacobian of \(r\) (a constant matrix). Thus, we can write \(\nabla f(x) = J^T(Jx-y)\) and \(\nabla^2 f(x) = J^T J\).

Clearly the \(f\) we have here is a convex function, and so any \(x^*\) such that \(\nabla f(x^*) = 0\) is a global minimizer of \(f\). Thus \(x^*\) must satisfy:

\( J^T r(x^*) = 0 \implies J^TJx^* = J^Ty\), the so-called normal equations.

This looks familiar, and we now discuss a few ways of solving this system of equations.

Using SVD

Every matrix \(J\in\mathbb{R}^{m\times n}\) has an singular value decomposition like so: \(J = U S V^T\), where \(U,V\) are orthogonal matrices and \(S\in\mathbb{R}^{m\times n}\) is a diagonal matrix with (non-negative) singular values on the diagonal. As is the usual convention, we assume the values are arranged such that \(\sigma_1 \geq \sigma_2 \ldots \geq \sigma_k\), where \(k = min(m,n)\).

Now, examine \(J^TJ\). This can be expressed as \(= V S^T S V^T\). The term \(S^TS\) is a \(n\times n\) diagonal matrix with entries \(\sigma_i^2\) for \(i\in[1,k]\). Thus, in the normal equations, \( VS^TSV^T x^* = VS^TU^Ty\), we left multiply \(V^T\) to see that a solution is possible if \(S^TS\) is invertible. For this to be true:

Size must be s.t. \(m \geq n\),
\(\sigma_n \neq 0\).

Under these conditions (also equivalent to saying that \(J\) must have full column rank), it can be seen that the solution comes out as:

\(x^* = VD^{-1} U_1^T y\), where for convenience we have expressed \(S = [D, 0^T]\) and \(D\) is a square diagonal matrix of size \(n\times n\), and \(U_1\) represents a matrix composed of the first \(n\) columns of \(U\). A more illuminating version is:

\(x^* = \sum_{i=1}^n \frac{1}{\sigma_i} (u_i^Ty) v_i\).

This version of the solution exposes the sensitivity of \(x^*\) to perturbations in the data, \(y\). In particular, when the singular values are very low in value (i.e. \(J\) is nearly rank deficient), any error in \(y\) is going to get greatly amplified in the solution \(x^*\). In such situations, it might be better to truncate the summation, effectively using a truncated SVD instead of the full SVD. Thus, while we have paid a cost of computing the SVD, we have gained important sensitivity related information.

Cholesky based

We can use a Cholesky decomposition based solution to solve this problem. It follows in two steps:

Since \(J^TJ\) is sym PD, it admits a Cholesky decomposition of the type \(J^TJ = W^TW\), where \(W\) is a \(n\times n\) upper triangular with positive diagonal elements as long as \(J\) has full column rank.
Then, we express \(W^TW x^* = J^T y\) as \(W^T z = J^Ty\). This is easy to solve by triangular substitutions, and once obtained we do a similar procedure to obtain \(Wx^* = z\).

This procedure is preferable to LU decomposition due to a slightly lower cost. However, since we are working with \(J^TJ\), our condition number is squared, i.e. worse.

Other approaches

A popular approach to avoid the squared condition number is to use the QR decomposition (refer to Ch 10.2 of NW). If the problem size is very large, then using a conjugate gradient method on the normal equations is preferred.

Nonlinear least squares

Gauss-Newton method

In the case of nonlinear least squares problems, a modification of the Newton method — the Gauss-Newton (GN) method, can be used to solve the problem.

Let us quickly recap the Newton method:

Assuming the Hessian is PD, find the search direction \(p\) at the \(k\)th iteration by solving: \(\nabla^2 f(x_k) p = -\nabla f(x_k)\).
Find the step length by use of backtracking line search subject to the Wolf conditions.

As we noted earlier, if we use the approximate form of the Hessian (where we ignore the second term involving \(\nabla^2 r_j(x)\)), we can write \(\nabla^2 f_k \approx J_k^T J_k\). This form also solves the problem of satisfying the PD property that is required in step (1) above provided \(J_k\) has full column rank. Thus, the GN method proceeds by solving for the direction \(p_k\) as:

\begin{equation} J_k^TJ_k \, p_k = -J_k^T r_k. \end{equation}

Once we find \(p_k\), we select \(\alpha_k\) as per the Wolfe conditions. It is not surprising that under these conditions, along with the usual conditions of the \(r_j\)s being Lipschitz continuous differentiable in the neighbourhood of the solution, and the Jacobian at each stage having full column rank, we get global convergence i.e. \(\lim_{k\to \infty} \nabla f(x_k) \to 0\).

We now note a few important properties of the GN method:

Since the only cost incurred in setting up these equations is that of computing the Jacobian, the method is quite efficient.
Whenever \(J_k\) has full rank and \(\nabla f_k\) is non-zero, the obtained direction is a legitimate descent direction. In order to this to happen, we require \(p^T \nabla f_k < 0\). Substituting the expression for \(\nabla f_k\), we get \(p^T J_k^T r_k = p^T (-J_k^T J_k p_k) = - \|J_k p_k\|^2 \leq 0\). If equality is achieved here, then \(J_k p_k =0\), which in turn implies that \(J_k^T r_k = 0\), i.e. \(\nabla f_k =0\) which \(\implies x_k\) is a stationary point.
Finally, we can make a connection with the linear least squares problem. Note the equation for the direction \(p_k\) can be rewritten as \(J_k^T(J_kp+r_k) = 0\), in otherwords, this is the same as \(\underset{p}{min} \frac{1}{2}\|J_k p + r_k\|^2\). So, we solve a linear least squares problem to find \(p\), then follow it up with a line search to find the step length to move towards the solution to the nonlinear least squares problem!

A note on practical implementations: if ever one needs to compute the product: \(J^T Jp\), the efficient way to parse it is to compute: \(J^T(Jp)\), i.e. first compute \(y=Jp\), then \(z=y^TJ\), then finally \(z^T\). Thus only matrix-vector products with \(J\) are required and we never compute or store the matrix-matrix product \(J^TJ\).