Lecture 22: Dimension reduction

Author

Jamie Haddock

Rayleigh quotient

For a matrix \(\mathbf{A}\) and compatible vector \(\mathbf{x}\), \(\mathbf{x}^* \mathbf{A} \mathbf{x}\) is scalar-valued. This type of function is called a quadratic form.

Definition: Rayleigh quotient

Given symmetric (Hermitian) \(\mathbf{A}\) and nonzero vector \(\mathbf{x}\), the Rayleigh quotient is the function

\[R_{\mathbf{A}}(\mathbf{x}) = \frac{ \mathbf{x}^* \mathbf{A} \mathbf{x}}{\mathbf{x}^* \mathbf{x}}.\]

If \(\mathbf{v}\) is an eigenvector such that \(\mathbf{A} \mathbf{v}=\lambda \mathbf{v}\), then one easily calculates that \(R_{\mathbf{A}}(\mathbf{v})=\lambda.\) That is, the Rayleigh quotient maps an eigenvector into its associated eigenvalue.

If \(\mathbf{A}^*=\mathbf{A}\), then the Rayleigh quotient has another interesting property: \(\nabla R_{\mathbf{A}}(\mathbf{v})=\boldsymbol{0}\) if \(\mathbf{v}\) is an eigenvector. By a multidimensional Taylor series, then,

\[R_{\mathbf{A}}(\mathbf{v}+\epsilon\mathbf{z}) = R_{\mathbf{A}}(\mathbf{v}) + 0 + O( \epsilon^2) = \lambda + O( \epsilon^2),\]

as \(\epsilon\to 0\).

A good estimate of an eigenvector becomes an even better estimate of an eigenvalue!


Code
n = 20;
λ = 1:n 
D = diagm(λ)
V,_ = qr(randn(n,n))   # get a random orthogonal V
A = V*D*V';
Code
R = x -> (x'*A*x)/(x'*x);  # define RQ
Code
R(V[:,7])   # evaluating the RQ at an evector gives corresponding evalue
7.000000000000001
Code
# RQ is within O(del^2) of the evalue, given a del perturbation to the evector
δ = @. 1 ./10^(1:5)
eval_diff = zeros(size(δ))
for (k,delta) in enumerate(δ)
    e = randn(n);  e = delta*e/norm(e);
    x = V[:,7] + e
    eval_diff[k] = R(x) - 7
end
labels = ["perturbation δ","δ²","R(x) - λ"]
pretty_table([δ δ.^2 eval_diff], header=labels)
┌────────────────┬─────────┬─────────────┐
│ perturbation δ │      δ² │    R(x) - λ │
├────────────────┼─────────┼─────────────┤
│            0.1 │    0.01 │   0.0275755 │
│           0.01 │  0.0001 │  0.00036161 │
│          0.001 │  1.0e-6 │  6.99687e-6 │
│         0.0001 │  1.0e-8 │  3.39643e-8 │
│         1.0e-5 │ 1.0e-10 │ 2.85269e-10 │
└────────────────┴─────────┴─────────────┘

Definiteness

In the real case, we called a symmetric matrix \(\mathbf{A}\) symmetric positive definite (SPD) if \(\mathbf{x}^\top \mathbf{A}\mathbf{x} > 0\) for all nonzero vectors \(\mathbf{x}\).

Theorem:

If \(\mathbf{A}^\top=\mathbf{A}\), then the following statements are equivalent.

  1. \(\mathbf{A}\) is SPD.
  2. The eigenvalues of \(\mathbf{A}\) are positive numbers.
Proof:

Suppose item 1 is true. If \(\mathbf{A}\mathbf{x} = \lambda \mathbf{x}\) is an eigenpair, then a Rayleigh quotient implies that \(\mathbf{x}^\top \mathbf{A}\mathbf{x} = \lambda \mathbf{x}^\top \mathbf{x} = \lambda \|\mathbf{x}\|^2\) and thus, \(\lambda > 0\). Hence item 2 is true.

Conversely, suppose item 2 is known. Then we can write the EVD as \(\mathbf{A}=\mathbf{V}\mathbf{S}^2\mathbf{V}^*\), where the \(S_{ii}\) are positive square roots of the eigenvalues. Hence

\[ \mathbf{x}^*\mathbf{A}\mathbf{x} = \mathbf{x}^*\mathbf{V}\mathbf{S}^2\mathbf{V}^*\mathbf{x} = \|\mathbf{S}\mathbf{V}^*\mathbf{x}\|^2 > 0, \]

as both \(\mathbf{S}\) and \(\mathbf{V}\) are invertible. Thus, item 1 is true.

For a SPD matrix, the EVD \(\mathbf{A}=\mathbf{V}\mathbf{D}\mathbf{V}^*\) meets all the requirements of the SVD, provided the ordering of eigenvalues is chosen appropriately. They coincide!

A symmetric matrix with all negative eigenvalues is called negative definite, and one with eigenvalues of different signs is indefinite. Finally, if one or more eigenvalues is zero and the rest have one sign, it is positive or negative semidefinite.

Dimension reduction

The SVD provides an optimal rank-constrained approximation to a given matrix – this proves very useful in a variety of applications.

SVD Optimality

Let \(\mathbf{A}\) be a real \(m\times n\) matrix with SVD \(\mathbf{A}=\mathbf{U}\mathbf{S}\mathbf{V}^T\) and (momentarily) \(m\ge n\). Another way of writing the thin form of the SVD is

\[\begin{split} \mathbf{A} = \hat{\mathbf{U}}\hat{\mathbf{S}}\mathbf{V}^T &= \begin{bmatrix} \rule[-0.3em]{0pt}{1em} \mathbf{u}_1 & \mathbf{u}_2 & \cdots & \mathbf{u}_n \end{bmatrix} \: \begin{bmatrix} \sigma_1 & & \\ & \ddots & \\ & & \sigma_n \end{bmatrix} \: \begin{bmatrix} \mathbf{v}_1^T \\ \vdots \\ \mathbf{v}_n^T \end{bmatrix}\ \\ &= \begin{bmatrix} \rule[-0.3em]{0pt}{1em} \sigma_1\mathbf{u}_1 & \cdots & \sigma_n\mathbf{u}_n \end{bmatrix}\: \begin{bmatrix} \mathbf{v}_1^T \\ \vdots \\ \mathbf{v}_n^T \end{bmatrix} \\ &= \sigma_1 \mathbf{u}_{1}\mathbf{v}_{1}^T + \cdots + \sigma_r \mathbf{u}_{r}\mathbf{v}_{r}^T = \sum_{i=1}^r \sigma_i \mathbf{u}_{i}\mathbf{v}_{i}^T, \end{split}\]

where \(r\) is the rank of \(\mathbf{A}\). The final formula also holds for the case \(m<n\).


Each outer product \(\mathbf{u}_{i}\mathbf{v}_{i}^T\) is a rank-1 matrix of unit 2-norm. Thanks to the ordering of singular values, this expresses \(\mathbf{A}\) as a sum of decreasingly important contributions.

This motivates the definition, for \(1\le k \le r\),

\[\mathbf{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_{i}\mathbf{v}_{i}^T = \mathbf{U}_k \mathbf{S}_k \mathbf{V}_k^T,\]

where \(\mathbf{U}_k\) and \(\mathbf{V}_k\) are the first \(k\) columns of \(\mathbf{U}\) and \(\mathbf{V}\), respectively, and \(\mathbf{S}_k\) is the upper-left \(k\times k\) submatrix of \(\mathbf{S}\).

The rank of a sum of matrices is always less than or equal to the sum of the ranks, so \(\mathbf{A}_k\) is a rank-\(k\) approximation to \(\mathbf{A}\). It turns out that \(\mathbf{A}_k\) is the best rank-\(k\) approximation of \(\mathbf{A}\), as measured in the matrix 2-norm.


Theorem:

Suppose \(\mathbf{A}\) has rank \(r\) and let \(\mathbf{A}=\mathbf{U}\mathbf{S}\mathbf{V}^T\) be an SVD. Let \[\mathbf{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_{i}\mathbf{v}_{i}^T = \mathbf{U}_k \mathbf{S}_k \mathbf{V}_k^T\] for \(1\le k < r\). Then

  1. \(\| \mathbf{A} - \mathbf{A}_k \|_2 = \sigma_{k+1}, \quad k=1,\ldots,r-1\), and
  2. If the rank of \(\mathbf{B}\) is \(k\) or less, then \(\| \mathbf{A}-\mathbf{B} \|_2\ge \sigma_{k+1}\).
Proof of part 1:

Note that the definition of \(\mathbf{A}_k\) is identical to the SVD of \(\mathbf{A}\) with \(\sigma_{k+1},\ldots,\sigma_r\) all set to zero. This implies that

\[\mathbf{A} - \mathbf{A}_k = \mathbf{U}(\mathbf{S}-\hat{\mathbf{S}})\mathbf{V}^T,\]

where \(\hat{\mathbf{S}}\) has those same values of \(\sigma_i\) replaced by zero. But that makes the above an SVD of \(\mathbf{A} - \mathbf{A}_k\), with singular values \(0,\ldots,0,\sigma_{k+1},\ldots,\sigma_r\), the largest of which is \(\sigma_{k+1}\). That proves the first claim.

Compression

If the singular values of \(\mathbf{A}\) decrease sufficiently rapidly, then \(\mathbf{A}_{k}\) may capture the most significant behavior of the matrix for a reasonably small value of \(k\).

Code
# make an image from text and reload it as matrix
plot(annotations=(0.5,0.5,text("Hello world",44,:center,:center)),
    grid=:none,frame=:none,size=(400,150))
savefig("hello.png")
img = load("hello.png")
A = @. Float64(Gray(img)) 
plot(Gray.(A),frame=:none)

Code
# singular values decrease until they reach numerical zero at k = 45
U,σ,V = svd(A)
scatter(σ,xaxis=("i"), yaxis=(:log10,"sigma_i"),
    title="Singular values",legend=false)

Rapid decrease suggests that low-rank approximations can be fairly accurate!


Code
plt = plot(layout=(2,2),frame=:none,aspect_ratio=1,titlefontsize=10)
for i in 1:4
    k = 3i
    Ak = U[:,1:k]*diagm(σ[1:k])*V[:,1:k]'
    plot!(Gray.(Ak),subplot=i,title="rank = $k")
end
plt
Code
m,n = size(A)  # compression ratio for rank-9 approximation
compression = m*n / (9*(m+n+1))
12.099213551119178