In this module, we will review some of the basic aspects of the mathematical background required in this course. Having taken a formal course in Linear Algebra (at the level of Gilbert Strang’s book) is necessary. Most of the content here has been taken from the Appendices of the reference books (NW and BV).

Linear Algebra

Basics

  • A vector \(x\) in \(n\) dimensional space, \(\mathbb{R}^n\), is represented as a column with \(n\) entries like so: \(x = \begin{bmatrix} x_1 \\ x_2 \\ \ldots \\ x_n \end{bmatrix}\). The transpose, \(x^T\) is simply a row vector with the same entries, whereas the hermitian (or conjugate transpose), \(x^H\) has the complex conjugate entries as a row vector.

  • Similarly, the components of a matrix \(A \in \mathbb{R}^{m\times n}\) are denoted as \(A_{ij}\). An often encountered matrix is a positive definite matrix, defined as follows: \(x^TAx > 0 \,\, \forall x \in \mathbb{R}^n\).

  • Inner and outer product representations of a matrix product: Say that we have three matrices such that \(C=AB\). In the usual inner product representation, we can write the element \(C_{ij} = A_i \, b_j\), where \(A_i\) is row-\(i\) of \(A\), and \(b_j\) is column-\(j\) of \(B\). Another equivalent way of expressing \(C\) is to write it in outer product form: \(C = \sum_{i=1}^p a_i B_i\), where \(a_i\) is column-\(i\) of \(A\), and \(B_i\) is row-\(j\) of \(B\) and \(p\) is the number of columns of \(A\) (or number of rows of \(B\)). This outer product role plays an important role in the SVD as we shall see further.

The Gram matrix for a set of vectors \(\{v_i\}\) is defined as \(G_{ij}=\langle v_i,v_j\rangle\). Prove that this matrix is positive semi-definite.

By arranging the vectors \(\{v_i\}\) as the columns of a matrix \(A\), the Gram matrix can be expressed as \(G = A^TA\). Now consider the expression \(x^T G x\). Substituting the defn of \(G\), we get \(x^T G x = x^T A^T A x = \|Ax\|_2^2 = (\sum_i v_i x_i)^2 \geq 0\). When the vectors \(\{v_i\}\) are linearly independent, the inequality is strict (i.e. \(G\) is positive definite).

Norms

On vector norms:

  • The p-norm of a vector \(x \in \mathbb{R}^n\) is defined as: \(\|x\|_p = \left(\sum_{i=1}^n |x_i|^p\right)^{1/p} \,\, 1\leq p \leq \infty\). The most commonly encountered norm is for \(p=2\) (also called the Euclidean norm), whereas the \(\infty\) norm is defined as \(\|x\|_{\infty} = \underset{i\in[1,n]}{\text{max}} \, |x_i|\).

  • Important properties of norms are:

    1. \(\|x\|=0 \, \implies \, x=0\),

    2. \(\|\alpha x\| = |\alpha|\|x\|\), \(\alpha\in\mathbb{R}\), and

    3. \(\|x+z\| \leq \|x\| + \|z\|, \,\, \forall x,z\in \mathbb{R}^n\) (triangle inequality).

Why is it not a norm for \(p<1\)?

A simple counter example shows that the triangle inequality fails for \(p<1\). For e.g. consider \(p=0.5\) and \(x=(1,0), z=(0,1)\). Clearly \(\|x\|=\|z\|=1\) whereas \(\|x+z\|=4\). To see why \(p>1\) is necessary, refer to a couple of proofs here.

  • Associated with any norm is it’s dual norm, defined as: \(\|x\|_D = \underset{\|y\|=1}{\text{max}} \, x^Ty\).

On matrix norms:

  • Matrix norms that are consistent with the above defined vector norms are defined as: \(\|A\| = \underset{x \neq 0}{\text{sup}} \, \frac{\|Ax\|}{\|x\||}\).

Prove that \(\|A\|_1\) is the max col sum of the absolute values of matrix \(A\)

Consider the numerator; \(\|Ax\|_1 = \sum_i | \sum_j A_{ij} x_j |\). Applying the triangle inequality to the inner summation gives \(| \sum_j A_{ij} x_j | \leq \sum_j |A_{ij}| |x_j |\). Substituting back and changing the order of summation gives: \( \|Ax\|_1 \leq \sum_j |x_j| \sum_i |A_{ij}| = \sum_j |x_j| S_j\). The inner most summation denotes \(S_j\), the column sum (of absolute values). Clearly then, for all \(x\neq 0\), we will have \( \|Ax\|_1 \leq S_m \sum_j|x_j|\), which leads to \(\frac{\|Ax\|_1}{\|x\|_1} \leq S_m\), where \(m\) denotes the column index with max absolute sum. By considering the case of \(x\) being non-zero in only one entry, it is possible to see that the inequality turns to an equality, i.e. \(S_m\) is the least upper bound. Thus, \(S_m\) is indeed the supremum in the defn, and we get \(\|A\|_1 = S_m\).

  • The 2-norm is often quite useful and takes the form \(\|A\|_2=\) largest eigenvalue of \((A^TA)^{1/2}\). Another useful norm (though not a consistent norm) is the Frobenius norm, defined as: \(\|A\|_F = \left(\sum_i \sum_j |A_{ij}|^2 \right)^{1/2}\).

  • The condition number is a very useful concept in numerical applications and for a non-singular matrix \(A\) is defined as: \(\kappa(A) = \|A\| \|A^{-1}\|,\,\,\text{for any matrix norm } \|.\|\).

Subspaces

  • The subset \(S \subset \mathbb{R}^n\) is a subspace of \(\mathbb{R}^n\) if given any two elements in it, their linear combination also lies in it.

  • A set of vectors \(\{s_1,s_2,\ldots,s_m\} \in \mathbb{R}^n\) is called:

    1. linearly independent is none of the vectors can be expressed as a linear combination of the others (there are other equivalent definitions).

    2. a spanning set for a set \(S\) if any vector in \(S\) can expressed as a linear combination of the given vectors.

  • Combining the above two concepts, we call the vectors \(\{s_1,s_2,\ldots,s_m\} \in \mathbb{R}^n\) a basis of \(S\) if the vectors are linearly independent and a spanning set for \(S\). In such a case, the dimension of \(S\) is the number of elements in the basis, \(m\).

  • There are four fundamental subspaces associated with a matrix \(A \in \mathbb{R}^{m\times n}\), best visualized like so:

    Parent space \(\rightarrow\) \(\mathbb{R}^n\) \(\mathbb{R}^m\)

    Name

    Row space (dim \(r\)), also Range(\(A^T\))

    Column space (dim \(r\)), also Range(\(A\))

    Definition

    \(\{x | x = A^Ty \text{ for any } y \in \mathbb{R}^m \}\)

    \(\{y | y = Ax \text{ for any } x \in \mathbb{R}^n \}\)

    Name

    Null space (dim \(n-r\)), also Null(\(A\))

    Left null space (dim \(m-r\)), also Null(\(A^T\))

    Definition

    \(\{x | Ax=0\}\)

    \(\{y | A^T y=0\}\)

    Now, the fundamental theorem of linear algebra states that the Row and Null spaces are orthogonal to each other and their direct sum is \(\mathbb{R}^n\). Similarly, the Column and Left null spaces are orthogonal to each other and their direct sum is \(\mathbb{R}^m\).

Eigen and SVD analysis

  • Given a square matrix \(A\), we say that a non-zero vector \(x\) is its eigenvector with an eigenvalue \(\lambda\) if the following holds: \(Ax=\lambda x\).

  • If a matrix is symmetric, its eigenvalues are all real, and with the associated eigenvectors \(q_i\) admits the following spectral decomposition: \(A = Q\Lambda Q^T = \sum_{i=1}^n \lambda_i q_i q_i^T\), where \(Q\) is an orthogonal matrix with \(q_i\)'s as column vectors. If further, the matrix is symmetric positive definite, the eigenvalues are all positive real numbers.

Proof of the last statement

If \(A\) is symmetric, the spectral decomposition gives \(A =\sum_{i=1}^n \lambda_i q_i q_i^T\). Further, it is positive definite, then \(x^T A x > 0 \, \forall x\), i.e. \( \sum_i \lambda_i (q_i^T x)^2 \geq 0\). The only way this is possible \(\forall x\) is if \(\lambda_i \geq 0\) \(\forall i\).

  • All matrices have a singular value decomposition as \(A = U\Sigma V^T\), where \(U,V\) are square orthogonal matrices and \(\Sigma\) takes the form:

    1. \(\begin{bmatrix} S \\ 0\end{bmatrix}\) if \(A\) is tall,

    2. \(\begin{bmatrix} S & 0\end{bmatrix}\) if \(A\) is fat,

    3. \(\begin{bmatrix} S \end{bmatrix}\) if \(A\) is square.

      In the above, \(S\) is a diagonal matrix with the singular values \(\sigma_i\) (each \(\geq 0\)) on the diagonal elements. In outer product notation, we can also write \(A = \sum_i \sigma_i u_i v_i^T\).

  • It is easy to express the pseudoinverse of a matrix, \(A^{+}\) using the SVD \(A^{+} = V \Sigma^{+} U^T\), where \(\Sigma^{+}\) is a diagonal matrix of the same size as \(\Sigma^T\) and entries \(\Sigma^{+}_{ii} = 1/\sigma_i\).

  • For symmetric positive definite matrix \(A\) with singular values \({\sigma_i}\), the additional property holds: \(\|A\|_2 = \sigma_1(A)\), and \(\|A^{-1}\| = 1/\sigma_n(A)\). This implies that the condition number of a matrix can be expressed in terms of the minimum and maximum singular values: \(\kappa = \sigma_1/\sigma_n\).

  • For a square matrix \(A\), the following important properties hold:

    1. trace \((A)\) = \(\sum_i A_{ii} = \sum_i \lambda_i\),

    2. det \((A)\) = \(\Pi_i \lambda_i\).

Matrix factorizations

Apart from the SVD mentioned above, there are a few other commonly used matrix factorizations. Many of them first apply a permutation matrix \(P\) in order to rearrange the rows of columns of the matrix under consideration. This is usually done to improve the numerical properties of the factorization.

  • The LU decomposition of a square matrix, \(A\), is given as \(PA = L U\) where \(L\) is lower triangular and \(U\) is upper triangular. When solving a matrix equation of the form \(Ax = b\), performing a one-time factorization of \(A\) allows for multiple solves for different right hand sides \(b\).

  • When \(A\) is symmetric positive definite, the Cholesky factorization leads to \(A=LL^T\), where \(L\) is lower triangular with real and positive diagonal entries.

  • For rectangular matrices, the QR decomposition gives \(AP = QR\), where \(Q\) is orthogonal and \(R\) is upper triangular.

Conditioning

A problem is called well conditioned if its solution is not affected severely by a small perturbations in the problem definition. Else, the system is called ill-conditioned. For example, consider the matrix system given by \(Ax=b\) with solution \(x\), and a perturbed version of the system given by \(\hat{A}\hat{x}=\hat{b}\) with solution \(\hat{x}\). The condition number controls the change in the new solution like so:

\[\frac{\|x-\hat{x}\|}{\|x\|} \leq \kappa(A)\left[ \frac{\|A-\hat{A}\|}{\|A\|} + \frac{\|b-\hat{b}\|}{\|b\|} \right].\]

Analysis

Sequences

Let \(\{x_k\}\) be a sequence of points in \(\mathbb{R}^n\). We make the following points:

  • The sequence is said to converge to some point \(x\) (expressed as \(\lim_{k\rightarrow\infty} x_k = x\)) if for any \(\epsilon>0\), there is an index \(K\) such that \(\|x_k - x\|\leq \epsilon\) for all \(k\geq K\).

  • Given an index set \(\mathcal{S}\subset\{1,2,3,\ldots\}\), a subsequence of \(\{x_k\}\) is denoted by \(\{x_k\}_{k\in\mathcal{S}}\). For this subsequence, \(\hat{x}\) is called an accumulation or limit point if there is an infinite set of indices \(k_1,k_2,\ldots\) such that the subsequences \(\{x_{k_i}\}_{i=1,2,\ldots}\) converges to \(\hat{x}\). A sequence converges iff it has exactly one limit point.

An example

Consider the sequence given by: \(1,\frac{1}{3},2,1,\frac{1}{9},2,1,\frac{1}{81},2,\ldots\). From this, we can cull out three limit points, \(\hat{x} = 1, \hat{x} = 0, \hat{x} = 2\), corresponding to three different subsequences. On the other hand, for the sequence given by \(x_k = \cos(k)\), every point in the interval \([-1,1\)] is a limit point.

  • A sequence is said to be a Cauchy sequence if for any \(\epsilon>0\) there exists and integer \(K\) such that \(\|x_k-x_l\|\leq \epsilon\) for all indices \(k\geq K\) and \(l \geq K\). A sequence converges iff it is a Cauchy sequence.

Moving on to the case of scalar sequences \(\{t_k\} \in \mathbb{R}\), some important terminologies are established.

  • A sequence is bounded above if there is a scalar \(u\) s.t. \(t_k \leq u \, \forall k\) (and an analogous definition for being bounded below follows).

  • A nondecreasing sequence is one where \(t_{k+1} \geq t_k \, \forall k\). If a sequence is nondecreasing and bounded below, it converges, i.e. \(\lim_{k\rightarrow \infty} t_k = t\). An analogous definition for being nonincreasing and bounded below follows.

  • The supremum of a sequence \(\{t_k\}\) is the smallest number \(u\) such that \(t_k \leq u\), denoted as sup \(\{t_k\}\). Similarly, the infimum of the sequence is the largest number \(v\) such that \(t_k \geq v\), denoted as inf \(\{t_k\}\).

Convergence rates

We note three popular ways of characterizing the convergence of a series \(\{x_k\}\) that converges to \(x^*\) (below, "Q" stands for "quotient", although it is often dropped in common parlance). Convergence is:

  1. Q-linear if \(\frac{\|x_{k+1}-x^*\|}{\|x_k-x^*\|} \leq r\) for all \(k\) sufficiently large and \(r\in(0,1)\),

  2. Q-superlinear if \(\lim_{k\rightarrow\infty} \frac{\|x_{k+1}-x^*\|}{\|x_k-x^*\|} =0\),

  3. Q-quadratic if \(\frac{\|x_{k+1}-x^*\|}{\|x_k-x^*\|^2} \leq M\) where \(M\) is a positive constant (not necessarily less than 1).

Example: What is the convergence of the series \(1/k!\)?

It is clear that this sequence converges to 0, i.e. \(x^* = 0\). Consider the ratio \(x_{k+1}/x_k = 1/(k+1)\). As \(k \to \infty\), \(x_{k+1}/x_k \to 0\), so it is Q-superlinear. But is it also Q-quadratic? Consider the ratio \(x_{k+1}/x_k^2 = k!/(k+1)\). Clearly, as \(k \to \infty\), this ratio also \(\to \infty\), thus it the convergence is not Q-quadratic.

Topology

Here are different nomenclatures regarding a set \(\mathcal{F}\):

  • It is bounded if for some real number \(M\), we have \(\|x\|\leq M\) for all \(x\in\mathcal{F}\).

  • It is open if for every \(x\in\mathcal{F}\), we can find an \(\epsilon > 0\) such that a ball containing \(x\) lies within itself, i.e. \(\{y | \|y-x\| \leq \epsilon\} \subset \mathcal{F}\).

  • It is closed if for all possible sequences \(\{x_k\} \in \mathcal{F}\), all limit points are elements of \(\mathcal{F}\).

  • The interior, denoted by int \(\mathcal{F}\), is the largest open set contained in \(\mathcal{F}\). Similarly, the closure, denoted by cl \(\mathcal{F}\), is the smallest closed set containing \(\mathcal{F}\). Obviously, if \(\mathcal{F}\) is open, then int \(\mathcal{F}\) = \(\mathcal{F}\), whereas if it is closed, then cl \(\mathcal{F}\) = \(\mathcal{F}\). Additionally, the union of finitely many closed sets is closed, while the intersection of closed sets is closed. The intersection of finitely many open sets is open, while and union of open sets is open.

  • It is compact if every sequence \(\{x_k\}\) in \(\mathcal{F}\) has at least one limit point, and all such limit points are in \(\mathcal{F}\). An important result in topology states — \(\mathcal{F}\in\mathbb{R}^n\) is closed and bounded \(\implies\) \(\mathcal{F}\) is compact.

  • The neighbourhood of a point \(x\) is an open set containing it. For example, the open ball of radius \(\epsilon\) around \(x\) is — \(\mathbb{B}(x,\epsilon) = \{y | \, \|y-x\|< \epsilon\}\).

Convex sets

We start with some definitions regarding affine sets:

  • Given two points \(x_1,x_2\in\mathbb{R}^n\) and \(\theta \in \mathbb{R}\), points of the form \(y = \theta x_1 + (1-\theta) x_2\) form a line passing through \(x_1,x_2\).

  • A set \(\mathcal{S}\subseteq \mathbb{R}^n\) is affine if the line through any two distinct points in C lies in C. Generalizing to more than 2 points, we say that the point given by \(\sum_i \theta_i x_i\) where \(\sum_i \theta_i =1\) is an affine combination of the points \(\{x_i\}\).

Prove that we get an affine set by adding an offset to a vector space.

Among the various properties of a vector space, is one which states that the origin must belong to the space. Say that we have a vector space \(V\) and \(x\in V\) and another vector \(b \notin V\). Then it is clear that the space given by \(S: \{y=x+b | x\in V, b\notin V\}\) is not a vector space because the origin does not belong to it. However, is it an affine set? Consider two points, \(y_1,y_2\), in \(S\). Now, consider the combination: \(\theta y_1 + (1-\theta) y_2 = \theta x_1 + (1-\theta) x_2 + b\). Since \(x_1,x_2 \in V\), so also is their combination, \(x'=\theta x_1 + (1-\theta) x_2\). Thus, the overall combination \(x'+b\) belongs to \(S\), and the set is affine.

  • The set of all affine combinations of points in some set \(\mathcal{S} \subseteq \mathbb{R}^n\) is called the affine hull of \(\mathcal{S}\), denoted as: aff \(\mathcal{S} = \{\sum_i \theta_i x_i \, | \, \{x_i\} \in \mathcal{S}, \, \sum_i \theta_i = 1\}\). It is the smallest affine set that contains \(\mathcal{S}\).

What is the affine hull of the set given as \(S=\{x\in\mathbb{R}^3|x_3 \geq \sqrt{x_1^2+x_2^2}\}\)?

aff \(\mathcal{S} = \mathbb{R}^3\)

Moving on to convex sets:

  • A set \(\mathcal{S}\) is convex if the line segment between any two points in \(\mathcal{S}\) lies in \(\mathcal{S}\), i.e. for \(x_1,x_2 \in \mathcal{S}\) and any \(0\leq \theta\leq 1\), we have \(\theta x_1 + (1-\theta) x_2 \in \mathcal{S}\). See Fig 2.2 of BV.

Show that the hyperbolic set given by \(S=\{x\in\mathbb{R}_{+}^2|x_1x_2 \geq 1 \}\) is convex.
  • We say that the point given by \(\sum_i \theta_i x_i\) where \(\sum_i \theta_i =1\) and \(\theta_i \geq 0\) is a convex combination of the points \(\{x_i\}\). This leads to the definition of the convex hull of a set \(\mathcal{S}\) as the set of all convex combinations of points in the set, denoted as conv \(\mathcal{S}\).

Now we look at cones:

  • A set \(\mathcal{S}\) is called a cone if for every \(x\in \mathcal{S}\) and \(\theta \geq 0\), we have that \(\theta x \in \mathcal{S}\).

  • A set \(\mathcal{S}\) is called a convex cone if it is convex and a cone, i.e. for every \(x_1, x_2 \in \mathcal{S}\) and \(\theta_1, \theta_2 \geq 0\), we have that \(\theta_1 x_1 + \theta_2 x_2 \in \mathcal{S}\). See Fig. 2.4 of BV.

Convex functions

A function \(f\) is a convex function if its domain is a convex set and given any two points \(x,y\) in the domain, the following property holds:

\(f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha) f(y)\), for all \(\alpha \in [0,1\) ].

Graphically, this means that given two points, the graph of the function is always below the line segment connecting the two points. Simple examples of convex functions are linear functions, \(f(x) = a^Tx+b\), for a constant vector \(c\) and scalar \(b\), or the quadratic function \(f(x) = x^THx\), where \(H\) is a symmetric positive semi-definite matrix.

Show the convexity of the expression given by \(f(x) = x^TAx + b^Tx +c\) where \(A\) is a symmetric positive semi-definite matrix.

We can easily prove this using the definition of convexity. Consider any two vectors \(x,y\) and their convex combination \((x+y)/2\). Then, \( (f(x) + f(y))/2 - f((x+y)/2) = (x-y)^TA(x-y) \geq 0 \) by the property of matrix \(A\). This idea will hold for any convex combination of \(x,y\).

Continuity

The notation for a function is as follows: \(f : A \rightarrow B\), which means \(f\) is a function on the set \(\mathcal{D} = \) dom \(f \subseteq A\) into the set \(B\).

  • A function \(f\) is continuous at \(x\in\) dom \(f\) if for all \(\epsilon>0\) there exists a \(\delta\) s.t.:

    \(y\in\) dom \(f\), \(\|y-x\|_2\leq \delta \implies \|f(y)-f(x)\|_2 \leq \epsilon\).

    Further, a function is continuous if it is continuous at every point in its domain.

  • In the above definition, the constant \(\delta\) depends on \(\epsilon,x,y\) in general. In case we can find the constant \(\delta\) as purely a function of \(\epsilon\), then we term the function as uniformly continuous.

  • The function \(f\) is said to be Lipschitz continuous on some set \(\mathcal{N} \subset \mathcal{D}\) if there is a constant \(L\) such that

    \(\|f(x)-f(y)\| \leq L\|x-y\|\), for all \(x,y \in \mathcal{N}\).

See this clarifying note with examples on these notions of continuity.

Comment on the continuity properties of the function \(f(x)=\sqrt{(1-x^2)}\) for \(x\in[-1,1\) ]

Consider two points \(x,y \in [-1,1\)]. Let \(|x-y|\leq \delta\) and chose \(\epsilon = 2 \delta\). Consider \(|f(x) - f(y)|\). Note that \(|\sqrt{1-x^2} -\sqrt{1-y^2}|^2 = |\sqrt{1-x^2} -\sqrt{1-y^2}|\, |\sqrt{1-x^2} -\sqrt{1-y^2}|\), \(\leq |\sqrt{1-x^2} -\sqrt{1-y^2}| \, |\sqrt{1-x^2} + \sqrt{1-y^2}| = | x^2 - y^2 |\) since when \(a,b\geq 0\), then \(|a-b| \leq |a+b|\). Thus, \(|f(x) - f(y)| \leq \sqrt{| x^2 - y^2 |} \leq 2 \delta = \epsilon\). Thus the function is not only continuous, but also uniformly continuous.

However, is it Lipschitz continuous? We need to find a constant \(L\) such that \(|f(x)-f(y)| \leq L |x-y|\) for all \(x,y\in[-1,1\)]. Choose \(y=1\), thus we need \(|f(x)|=\sqrt{1-x^2} \leq L |x-1|\), for all \(x\in[-1,1\)]. Now consider what happens when \(x\to 1^{-}\): \(\lim_{x\to 1^{-}} \sqrt{1-x^2}/(1-x) = \sqrt{1+x}/\sqrt{1-x} \to +\infty\). As a result, no value of \(L\) can satisfy the Lipschitz condition condition.

Derivatives

Consider a function \(f: \mathbb{R}^n \rightarrow \mathbb{R}\).

  • The function \(f\) is called (Frechet) differentiable at \(x\) if there exists a vector \(g\in \mathbb{R}^n\) such that

    \(\lim_{y\to 0} \frac{f(x+y)-f(x) - g^Ty}{\|y\|} = 0 \), for any vector norm \(\|.\|\). When \(g\) exists, it is called the gradient of \(f\) at \(x\), denoted as

    \(\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}\, \ldots \, \frac{\partial f}{\partial x_n} \end{bmatrix}^T \). The gradient w.r.t. a subset of variables is denoted with a subscript to the \(\nabla\) symbol, e.g. \(\nabla_z\) to denote the gradient only w.r.t. the \(z\) variable.

  • A function is differentiable on a domain if \(\nabla f(x)\) exists for all \(x\) in the domain. It is further called continuously differentiable if \(\nabla f(x)\) is a continuous function of \(x\).

  • When \(f\) is vector valued, for e.g. \(f:\mathbb{R}^n\to \mathbb{R}^m\), the following definitions hold:

    We define \(\nabla f(x)\) to be a \(m \times n\) matrix whose \(i^{th}\) row is \(\nabla f_i(x)^T\). This matrix is called the Jacobian \(J(x)\) and \(J_{ij} = \partial f_i(x)/\partial x_j\). (Be careful as some references define the Jacobian as the transpose of the above definition).

  • The usual chain rule of differentiation is applicable as follows. Let \(x\) depend on \(t\), i.e. \(x = x(t)\). In this case, with \(h(t) = f(x(t))\), we get: \(\nabla h(t) = \sum_i \frac{\partial f}{\partial x_i} \frac{\partial x_i}{\partial t} = \nabla x(t) \nabla(f(x(t)))\). In the final expression, interpret \(\nabla x(t)\) as a row vector.

An example of a multidimensional chain rule

We are given a function \(f: \mathbb{R}^2 \to \mathbb{R}, f(u,v,w) = uv + vw - uw\), and another function \(g: \mathbb{R}^2 \to \mathbb{R}^3, g(x,y) = (x+y,x+y^2,x^2+y)\). Compute the gradient of the composition \((f\circ g)(x,y)\).

The composition looks like so: \(f(u(x,y),v(x,y),w(x,y))\). Since this is overall a function of \(x,y\), we require partial derivatives w.r.t. each of them. Using the chain rule for each of them; \(\frac{\partial f}{\partial x} = \frac{\partial f}{\partial u}\frac{\partial u}{\partial x} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial x} + \frac{\partial f}{\partial w}\frac{\partial w}{\partial x}\).

Similarly: \(\frac{\partial f}{\partial y} = \frac{\partial f}{\partial u}\frac{\partial u}{\partial y} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial y} + \frac{\partial f}{\partial w}\frac{\partial w}{\partial y}\).

A more elegant way to write this is:

\( \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix} = \begin{bmatrix} \frac{\partial f}{\partial u} & \frac{\partial f}{\partial v} & \frac{\partial f}{\partial w}\end{bmatrix} \begin{bmatrix} \frac{\partial u}{\partial x} & \frac{\partial u}{\partial y} \\ \frac{\partial v}{\partial x} & \frac{\partial v}{\partial y} \\ \frac{\partial w}{\partial x} & \frac{\partial w}{\partial y} \end{bmatrix} \).

We must substitute everything in terms of \(x,y\), e.g. for terms such as \( \frac{\partial f}{\partial u} = v - w = (x+y^2) - (x^2 + y)\), and so on.

A quadratic form \(\phi(x) : \mathbb{R}^n \to \mathbb{R}\) is defined as \(\phi(x) = x^T A x + b^T x + c\). Its gradient is given by \(\nabla \phi(x) = (A^T +A)x + b\).

Prove the above statement

Let’s start with the linear term, \(b^T x = \sum_i b_i x_i\) (which is a scalar). By using the definition of gradient (take derivatives w.r.t. \(x_k\) one at at time), it is easy to see that \(\nabla(b^T x) = b\).

So far we have computed the gradient of a scalar. We now investigate the gradient of a vector valued function, for e.g. \(f(x) = Ax\) where \(A\in\mathbb{R}^{n\times n}\). By using the earlier stated definition of the Jacobian, and expanding each element of \(Ax\) in terms of \(A_{ij}, x_{j}\), we get that \(\nabla(Ax) = A\). Trivially, we also have \(\nabla(x) = I\), the identity matrix.

For the quadratic term, we use the "product rule" of calculus. Say we have a function \(f(g(x),h(x)) = g^T(x)h(x)\). Then, \(\nabla f = (\nabla g)^T h + (\nabla h)^T g\). With \(g(x) = x\) and \(h(x) = Ax\), we get: \(\nabla(x^T Ax) = (I)(Ax) + (A)^T (x) = (A + A^T)x^T\).

Putting it all together, we get \(\nabla \phi(x) = (A^T +A)x + b\). To see an even more explicit derivation, look here.

  • The directional derivative of a function in a direction \(p\) is given by the definition: \(D(f(x),p) = \lim_{\epsilon\to 0} \frac{f(x+\epsilon p) - f(x)}{\epsilon}\).

    When \(f\) is continuously differentiable in a neighbourhood of \(x\), we get \(D(f(x),p) = \nabla f(x)^T p\). However, it is useful to note that the directional derivative may be well defined even when \(f\) is not continuously differentiable.

Find the derivative of the function \(f(x,y) = [1+x^2+2xy,\, y^2]^T\) at \(P=(4,3)\) in the direction towards the origin.

For the sake of understanding, let us first consider the case of a scalar valued function by considering the first component of \(f\), i.e. \(f_1 = 1 + x^2 + 2xy\). Here, \(\nabla(f_1) = [2x + 2y,\, 2x]^T = [14, \, 8]^T\). The unit vector in the direction being considered is \(p = [-4/5, \,-3/5]^T\). Thus, the directional derivative is simply \(-16\). It is easy to extend this idea now if we consider both components of \(f\) (left as an exercise). Note that this time, the directional derivative will be a vector.

  • The Hessian is a matrix of second partial derivatives of \(f\) and is defined as:

    \(\nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial^2 x_1} & \ldots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \ldots & \vdots \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \ldots & \frac{\partial^2 f}{\partial^2 x_n} \\ \end{bmatrix}\).

  • Mean value theorem: In the univariate case, given a continuously differentiable function \(f : \mathbb{R} \to \mathbb{R}\), and two real numbers \(x,y\) satisfying \(x<y\), then \(f(y) = f(x) + f'(z)(y-x)\), for some \(z\in(x,y) \).

    The theorem is extended to the multivariate case \(f: \mathbb{R}^n \to \mathbb{R}\), for any vector \(p\), we have: \(f(x+p) = f(x) + \nabla f(x + \alpha p)^Tp\), for some \(\alpha \in (0,1)\).

  • The above theorem also has an integral counterpart: From the fundamental theorem of calculus, given a continuous function \(f : \mathbb{R} \to \mathbb{R}\), defined over \( [a,b\) ], and \(F\) defined as: \(F(x) = \int_a^x f(t)dt\), then \(F\) is uniformly continuous on [a,b] and differentiable on \( (a,b) \) and \(F'(x) = f(x)\). We can conveniently write this as \(f(x) = \int_{a}^x f'(t)dt\). Invoking the MVT from the previous point, we can write \(f(y) - f(x) = \int_x^y f'(t) dt\). Performing a change of variables, \( t \rightarrow x+p(y-x)\), leads to \(f(y) - f(x) = (y-x) \int_0^1 f'(x+p(y-x)) dp\). This is referred to as the integral version of the MVT (compare with the original \( f(y) - f(x) = (y-x) f'(z)\) for some \(z\in(x,y)\)).

    In the multivariate case, where \(f: \mathbb{R}^n \to \mathbb{R}\), for vectors \(x,p\), we get: \(f(x+p) = f(x) + \int_0^1 \nabla f(x + tp)^Tp \,dt\).

Programming

Familiarity with programming languages such as MATLAB and Python is expected. My own preference is towards MATLAB because it allows very easy visualization, code profiling, and gives you access to several other MATLAB toolboxes such as Antenna, 5G, or DSP toolboxes in case you want to formulate optimization problems in those domains.

Moreover, as students of IITM you have full access to it, whether via a desktop installation or usage via a browser (be sure to sign up using your smail ID). A quick introduction to MATLAB can be seen here https://youtu.be/83SUbkEe5II.

Home