- June 1, 2020

Lecture Notes for MTH3320 Computational Linear Algebra Tiangang Cui & Hans De Sterck May 13, 2019 2 Contents Preface vii I Linear Systems of Equations 1 1 Introduction and Model Problems 3 1.1 A Simple 1D Example from Structural Mechanics . . . . . . 3 1.1.1 Discretising the ODE . . . . . . . . . . . . . . . 4 1.1.2 Formulation as a Linear System . . . . . . . . . 5 1.1.3 Solving the Linear System . . . . . . . . . . . . . 6 1.2 A 2D Example: Poisson’s Equation for Heat Conduction . . 7 1.2.1 Discretising the PDE . . . . . . . . . . . . . . . 8 1.2.2 Formulation as a Linear System . . . . . . . . . 8 1.2.3 Solving the Linear System . . . . . . . . . . . . . 10 1.3 An Example from Data Analytics: Netflix Movie Recommen- dation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.1 Movie Recommendation using Linear Algebra and Optimisation . . . . . . . . . . . . . . . . . . . . 12 1.3.2 An Alternating Least Squares Approach to Solv- ing the Optimisation Problem . . . . . . . . . . 14 2 LU Decomposition for Linear Systems 17 2.1 Gaussian Elimination and LU Decomposition . . . . . . . . 17 2.1.1 Gaussian Elimination . . . . . . . . . . . . . . . 17 2.1.2 LU Decomposition . . . . . . . . . . . . . . . . . 18 2.1.3 Implementation of LU Decomposition and Com- putational Cost . . . . . . . . . . . . . . . . . . . 20 2.2 Banded LU Decomposition . . . . . . . . . . . . . . . . . . . 23 2.3 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 Definition of Matrix Norms . . . . . . . . . . . . 25 2.3.2 Matrix Norm Formulas . . . . . . . . . . . . . . 26 2.3.3 Spectral Radius . . . . . . . . . . . . . . . . . . 28 2.4 Floating Point Number System . . . . . . . . . . . . . . . . 29 2.4.1 Floating Point Numbers . . . . . . . . . . . . . . 29 2.4.2 Rounding and Unit Roundoff . . . . . . . . . . . 29 2.4.3 IEEE Double Precision Numbers . . . . . . . . . 30 2.4.4 Rounding and Basic Arithmetic Operations . . . 32 2.5 Conditioning of a Mathematical Problem . . . . . . . . . . . 32 i ii Contents 2.5.1 Conditioning of a Mathematical Problem . . . . 32 2.5.2 Conditioning of Elementary Operations . . . . . 33 2.5.3 Conditioning of Solving a Linear System . . . . . 38 2.6 Stability of a Numerical Algorithm . . . . . . . . . . . . . . 40 2.6.1 A Simple Example of a Stable and an Unstable Algorithm . . . . . . . . . . . . . . . . . . . . . . 40 2.6.2 Stability of LU Decomposition . . . . . . . . . . 42 3 Least-Squares Problems and QR Factorisation 45 3.1 Gram-Schmidt Orthogonalisation and QR Factorisation . . . 45 3.1.1 Gram-Schmidt Orthogonalisation . . . . . . . . . 45 3.1.2 QR Factorisation . . . . . . . . . . . . . . . . . . 47 3.1.3 Modified Gram-Schmidt Orthogonalisation . . . 47 3.2 QR Factorisation using Householder Transformations . . . . 48 3.2.1 Householder Reflections . . . . . . . . . . . . . . 49 3.2.2 Using Householder Reflections to Compute the QR Factorisation . . . . . . . . . . . . . . . . . . 51 3.2.3 Computing Q . . . . . . . . . . . . . . . . . . . . 52 3.2.4 Computational Work . . . . . . . . . . . . . . . 53 3.3 Overdetermined Systems and Least-Squares Problems . . . . 54 3.3.1 The Normal Equations – A Geometric View . . . 55 3.3.2 The Normal Equations . . . . . . . . . . . . . . 55 3.3.3 Computational Work for Forming and Solving the Normal Equations . . . . . . . . . . . . . . . 57 3.3.4 Numerical Stability of Using the Normal Equations 57 3.4 Solving Least-Squares Problems using QR Factorisation . . . 57 3.4.1 Geometric Interpretation in Terms of Projection Matrices . . . . . . . . . . . . . . . . . . . . . . . 58 3.5 Alternating Least-Squares Algorithm for Movie Recommen- dation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.1 Least-Squares Subproblems for Movie Recommen- dation . . . . . . . . . . . . . . . . . . . . . . . . 60 4 The Conjugate Gradient Method for Sparse SPD Systems 63 4.1 An Optimisation Problem Equivalent to SPD Linear Systems 64 4.2 The Steepest Descent Method . . . . . . . . . . . . . . . . . 64 4.3 The Conjugate Gradient Method . . . . . . . . . . . . . . . 68 4.4 Properties of the Conjugate Gradient Method . . . . . . . . 70 4.4.1 Orthogonality Properties of Residuals and Step Directions . . . . . . . . . . . . . . . . . . . . . . 70 4.4.2 Optimal Error Reduction in the A-Norm . . . . 73 4.4.3 Convergence Speed . . . . . . . . . . . . . . . . . 75 4.5 Preconditioning for the Conjugate Gradient Method . . . . . 77 4.5.1 Preconditioning for Solving Linear Systems . . . 77 4.5.2 Left Preconditioning for CG . . . . . . . . . . . 78 4.5.3 Preconditioned CG (PCG) Algorithm . . . . . . 79 4.5.4 Preconditioners for PCG . . . . . . . . . . . . . 82 4.5.5 Using Preconditioners as Stand-Alone Iterative Methods . . . . . . . . . . . . . . . . . . . . . . 83 Contents iii 5 The GMRES Method for Sparse Nonsymmetric Systems 87 5.1 Minimising the Residual . . . . . . . . . . . . . . . . . . . . 87 5.2 Arnoldi Orthogonalisation Procedure . . . . . . . . . . . . . 88 5.3 GMRES Algorithm . . . . . . . . . . . . . . . . . . . . . . . 90 5.4 Convergence Properties of GMRES . . . . . . . . . . . . . . 92 5.5 Preconditioned GMRES . . . . . . . . . . . . . . . . . . . . 93 5.6 Lanczos Orthogonalisation Procedure for Symmetric Matrices 93 II Eigenvalues and Singular Values 97 6 Basic Algorithms for Eigenvalues 99 6.1 Example: Page Rank and Stochastic Matrix . . . . . . . . . 99 6.2 Fundamentals of Eigenvalue Problems . . . . . . . . . . . . . 104 6.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . 104 6.2.2 Eigenvalue and Eigenvector . . . . . . . . . . . . 105 6.2.3 Similarity Transformation . . . . . . . . . . . . . 106 6.2.4 Eigendecomposition, Diagonalisation, and Schur Factorisation . . . . . . . . . . . . . . . . . . . . 107 6.2.5 Extending Orthogonal Vectors to a Unitary Matrix110 6.3 Power Iteration and Inverse Iteration . . . . . . . . . . . . . 112 6.3.1 Power Iteration . . . . . . . . . . . . . . . . . . . 112 6.3.2 Convergence of Power Iteration . . . . . . . . . . 113 6.3.3 Shifted Power Method . . . . . . . . . . . . . . . 115 6.3.4 Inverse Iteration . . . . . . . . . . . . . . . . . . 115 6.3.5 Convergence of Inverse Iteration . . . . . . . . . 116 6.4 Symmetric Matrices and Rayleigh Quotient Iteration . . . . 119 6.4.1 Rate of Convergence . . . . . . . . . . . . . . . . 119 6.4.2 Power Iteration and Inverse Iteration for Sym- metric Matrices . . . . . . . . . . . . . . . . . . . 119 6.4.3 Rayleigh Quotient Iteration . . . . . . . . . . . . 121 6.4.4 Summary of Power, Inverse, and Rayleigh Quo- tient Iterations . . . . . . . . . . . . . . . . . . . 123 7 QR Algorithm for Eigenvalues 125 7.1 Two Phases of Eigenvalue Computation . . . . . . . . . . . . 125 7.2 Hessenberg Form and Tridiagonal Form . . . . . . . . . . . . 127 7.2.1 Householder Reduction to Hessenberg Form . . . 129 7.2.2 Implementation and Computational Cost . . . . 131 7.2.3 The Symmetric Case: Reduction to Tridiagonal Form . . . . . . . . . . . . . . . . . . . . . . . . 132 7.2.4 QR Factorisation of Hessenberg Matrices . . . . 134 7.3 QR algorithm without shifts . . . . . . . . . . . . . . . . . . 136 7.3.1 Connection with Simultaneous Iteration . . . . . 136 7.3.2 Convergence to Schur Form . . . . . . . . . . . . 139 7.3.3 The Role of Hessenberg Form . . . . . . . . . . . 140 7.4 Shifted QR algorithm . . . . . . . . . . . . . . . . . . . . . . 145 7.4.1 Connection with Inverse Iteration . . . . . . . . 145 7.4.2 Connection with Shifted Inverse Iteration . . . . 147 7.4.3 Connection with Rayleigh Quotient Iteration . . 147 iv Contents 7.4.4 Wilkinson Shift . . . . . . . . . . . . . . . . . . . 148 7.4.5 Deflation . . . . . . . . . . . . . . . . . . . . . . 148 8 Singular Value Decomposition 151 8.1 Singular Value Decomposition . . . . . . . . . . . . . . . . . 151 8.1.1 Understanding SVD . . . . . . . . . . . . . . . . 151 8.1.2 Full SVD and Reduced SVD . . . . . . . . . . . 153 8.1.3 Properties of SVD . . . . . . . . . . . . . . . . . 155 8.1.4 Compare SVD to Eigendecomposition . . . . . . 156 8.2 Computing SVD . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.2.1 Connection with Eigenvalue Solvers . . . . . . . 158 8.2.2 A Different Connection with Eigenvalue Solvers . 159 8.2.3 Bidiagonalisation . . . . . . . . . . . . . . . . . . 160 8.3 Low Rank Matrix Approximation using SVD . . . . . . . . . 164 8.4 Pseudo Inverse and Least Square Problems using SVD . . . 166 8.5 X-Ray Imaging using SVD . . . . . . . . . . . . . . . . . . . 170 8.5.1 Mathematical Model . . . . . . . . . . . . . . . . 170 8.5.2 Computational Model . . . . . . . . . . . . . . . 171 8.5.3 Image Reconstruction . . . . . . . . . . . . . . . 172 9 Krylov Subspace Methods for Eigenvalues 177 9.1 The Arnoldi Method for Eigenvalue Problems . . . . . . . . 177 9.2 Lanczos Method for Eigenvalue Problems . . . . . . . . . . . 183 9.3 How Arnoldi/Lanczos Locates Eigenvalues . . . . . . . . . . 185 10 Other Eigenvalue Solvers 191 10.1 Jacobi Method . . . . . . . . . . . . . . . . . . . . . . . . . . 191 10.2 Divide-and-Conquer . . . . . . . . . . . . . . . . . . . . . . . 194 A Appendices 197 A.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 A.1.1 Vectors and Matrices . . . . . . . . . . . . . . . 198 A.1.2 Inner Products . . . . . . . . . . . . . . . . . . . 198 A.1.3 Block Matrices . . . . . . . . . . . . . . . . . . . 198 A.2 Vector Norms . . . . . . . . . . . . . . . . . . . . . . . . . . 200 A.2.1 Vector Norms . . . . . . . . . . . . . . . . . . . . 200 A.2.2 A-Norm . . . . . . . . . . . . . . . . . . . . . . . 200 A.3 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . 202 A.4 Matrix Rank and Fundamental Subspaces . . . . . . . . . . 203 A.5 Matrix Determinants . . . . . . . . . . . . . . . . . . . . . . 204 A.6 Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 A.6.1 Eigenvalues and Eigenvectors . . . . . . . . . . . 205 A.6.2 Similarity Transformations . . . . . . . . . . . . 206 A.6.3 Diagonalisation . . . . . . . . . . . . . . . . . . . 206 A.6.4 Singular Values of a Square Matrix . . . . . . . . 207 A.7 Symmetric Matrices . . . . . . . . . . . . . . . . . . . . . . . 208 A.8 Matrices with Special Structure or Properties . . . . . . . . 210 A.8.1 Diagonal Matrices . . . . . . . . . . . . . . . . . 210 A.8.2 Triangular Matrices . . . . . . . . . . . . . . . . 210 A.8.3 Permutation Matrices . . . . . . . . . . . . . . . 210 Contents v A.8.4 Projectors . . . . . . . . . . . . . . . . . . . . . . 210 A.9 Big O Notation . . . . . . . . . . . . . . . . . . . . . . . . . 212 A.9.1 Big O as h→ 0 . . . . . . . . . . . . . . . . . . . 212 A.9.2 Big O as n→∞ . . . . . . . . . . . . . . . . . . 212 A.10 Sparse Matrix Formats . . . . . . . . . . . . . . . . . . . . . 213 A.10.1 Simple List Storage . . . . . . . . . . . . . . . . 213 A.10.2 Compressed Sparse Column Format . . . . . . . 213 Bibliography 215 vi Contents Preface This document contains lecture notes for MTH3320 – Computational Linear Algebra. Since MTH3320 is offered for the first time in 2017 S2, the notes will be built up as the term progresses. • Part I of the unit covers numerical methods for solving linear systems A~x = ~b (weeks 1-6). • Part II of the unit covers numerical methods for computing eigenvalues and singular values (weeks 7-12). • The Appendix of the notes covers a brief and condensed review of back- ground material in linear algebra (which may be reviewed in the lectures with some more detail, as needed). These notes are intended to be used in conjunction with the lectures. In their first incarnation, these notes will be quite dense, and, depending on the topic, more details and explanations may be provided in the lectures. Useful reference books on numerical linear algebra include [Saad, 2003], [Trefethen and Bau III, 1997], [Bjo¨rck, 2015], [Linge and Langtangen, 2016], [Gander et al., 2014], [Demmel, 1997], [Saad, 2011], [Quarteroni et al., 2010], [Ascher and Greif, 2011]. vii viii Preface Synopsis of MTH3320 The overall aim of this unit is to study the numerical methods for matrix com- putations that lie at the core of a wide variety of large-scale computations and innovations in the sciences, engineering, technology and data science. Students will receive an introduction to the mathematical theory of numerical methods for linear algebra (with derivations of the methods and some proofs). This will broadly include methods for solving linear systems of equations, least-squares problems, eigenvalue problems, and other matrix decompositions. Special at- tention will be paid to conditioning and stability, dense versus sparse problems, and direct versus iterative solution techniques. Students will learn to imple- ment the computational methods efficiently, and will learn how to thoroughly test their implementations for accuracy and performance. Students will work on realistic matrix models for applications in a variety of fields. Applications may include, for example: computation of electrostatic potentials and heat conduc- tion problems; eigenvalue problems for electronic structure calculation; ranking algorithms for webpages; algorithms for movie recommendation, classification of handwritten digits, and document clustering; and principal component analysis in data science. Part I Linear Systems of Equations Chapter 1 Introduction and Model Problems Objectives of this chapter 1. Motivation: In Part I of the unit we will develop, analyse, and implement numerical methods to solve large linear systems A~x = ~b. 2. This introductory chapter gives some examples of linear systems in real- life applications. 3. These examples will be used as model problems throughout Part I of the unit. 1.1 A Simple 1D Example from Structural Mechanics Consider a string of unit length under tension T, which is subjected to a trans- verse distributed load of magnitude p(x) per unit length (see figure). Let u(x) denote the vertical displacement at point x. We choose signs such that both p(x) and u(x) are positive in the upward direction. For small displacements, the vertical displacement u(x) is governed by the ordinary differential equation (ODE) du(x) dx2 = −p(x) T . 3 4 Chapter 1. Introduction and Model Problems Since the string is fixed on the left and right, we can use boundary conditions u(0) = 0 and u(1) = 0. The problem of finding the displacement u(x) is fully specified by the follow- ing ODE boundary value problem (BVP): BVP du(x) dx2 = −p(x) T x ∈ [0, 1] u(0) = 0 u(1) = 0 We can approximate the solution to this problem numerically by discretising the ODE and solving the resulting linear system A~v = ~b. 1.1.1 Discretising the ODE We discretise the ODE by deriving a finite different approximation for the second derivative in the ODE, using Taylor series expansions: u(x+ h) = u(x) + u′(x)h+ u′′(x)h2/2 + u′′′(x)h3/6 +O(h4), u(x− h) = u(x)− u′(x)h+ u′′(x)h2/2− u′′′(x)h3/6 +O(h4). Summing these up gives u(x+ h) + u(x− h) = 2u(x) + u′′(x)h2 +O(h4), from which we obtain u′′(x) = u(x+ h)− 2u(x) + u(x− h) h2 +O(h2). (1.1) We consider a grid that divides the problem domain [0, 1] into N + 1 intervals of equal length ∆x = h = 1 N + 1 with N + 2 equally spaced grid points xi given by xi = ih i = 0, . . . , N + 1 (i.e., there are two boundary points x0 and xN+1 at x = 0 and x = 1, and there are N interior points). We then approximate the unknown function u(x) (the exact solution to the BVP) at the grid points by discrete approximations vi: vi ≈ u(xi), using the finite difference formula. That is, we solve the following discretised BVP for the unknown numerical approximation values vi: discretised BVP vi+1 − 2vi + vi−1 h2 = −p(xi) T (i = 1, . . . , N) xi = ih v0 = 0 v1 = 0. 1.1. A Simple 1D Example from Structural Mechanics 5 1.1.2 Formulation as a Linear System This discretised BVP can be written as a linear system A~v = ~b with N equations for the N unknowns vi (i = 1, . . . , N) at the interior points of the problem domain. We normally consider square matrices of size n× n, so for this problem, the total number of unknowns n equals the number of interior grid points, i.e., we have n = N . We write the discretised BVP as A~v = ~b with the matrix A ∈ Rn×n given by the so-called 1-dimensional (1D) Laplacian matrix: Definition 1.1: 1D Laplacian Matrix (Model Problem 1) A = −2 1 0 . . . 0 1 −2 1 0 . . . 0 0 1 −2 1 0 . . . 0 … . . . . . . . . . … 0 . . . 0 1 −2 1 0 0 . . . 0 1 −2 1 0 . . . 0 1 −2 (1.2) The vectors ~v and ~b in A~v = ~b are given by ~v = v1 v2 … … vn−1 vn ~b = −h 2 T p(x1) p(x2) … … p(xn−1) p(xn) , with h = 1/(n+ 1). Note that the matrix A is tridiagonal, and it is very sparse: it has very few nonzero elements (close to 3 per row, on average). Definition 1.2: Sparse Matrix Let A ∈ Rm×n. 1. nnz(A) is the number of nonzero elements of A 2. A is called a sparse matrix if nnz(A) mn. Otherwise, A is called a dense matrix. Efficient numerical methods for this problem should exploit this sparsity, and the study of efficient numerical methods for sparse matrix problems is an important focus of this unit. 6 Chapter 1. Introduction and Model Problems 1.1.3 Solving the Linear System Suppose the transverse load in the above problem is given specifically by p(x) = −(3x+ x2) exp(x), and T = 100. The figure below shows the numerical approximation ~v obtained from solving A~v = ~b, for n = N = 2, 4, 8, 16. 0 0.2 0.4 0.6 0.8 1 -5 -4 -3 -2 -1 0 #10 -3 0 0.2 0.4 0.6 0.8 1 -5 -4 -3 -2 -1 0 #10 -3 0 0.2 0.4 0.6 0.8 1 -5 -4 -3 -2 -1 0 #10 -3 0 0.2 0.4 0.6 0.8 1 -5 -4 -3 -2 -1 0 #10 -3 As it happens, the exact solution to this problem can also be obtained in closed form: u(x) = x (x− 1) exp(x)/100, (it is shown in blue in the figure). This allows us to verify the accuracy of the numerical approximation, and it can be shown theoretically and verified numerically that the error u(xi)− vi = O(h2). Problem 1.3: Vertical Displacement in a String (not examinable) Can you show that the vertical displacement u(x) is governed by the ODE uxx = −p(x)/T? (Hint: Assume that displacements are small, so that the tension T can be taken as constant over the whole string, and so that the angle θ can be consid- ered small (θ is measured from the horizontal in counter-clockwise direction). Consider vertical force equilibirum.) 1.2. A 2D Example: Poisson’s Equation for Heat Conduction 7 1.2 A 2D Example: Poisson’s Equation for Heat Conduction We first consider models for heat flow in a metal plate. The flow of heat in a metal plate can be modeled by the heat equation, which is a partial differential equation (PDE) that describes the evolution of the temperature in the plate, u(x, y, t), in space and time: ∂u ∂t = κ ( ∂2u ∂x2 + ∂2u ∂y2 ) + g(x, y). (1.3) Here, κ is the heat conduction coefficient, and g(x, y) is a heat source or sink. We consider the specific problem of determining the stationary temperature distribution in a square domain Ω of length 1 m, (x, y) ∈ Ω = [0, 1] × [0, 1], with the temperature on the four boundaries fixed at u = u0 where u0 = 600 Kelvin, and with a heat source g(x, y) with Gaussian profile centered at (x, y) = (3/4, 3/4) : g(x, y) = 10, 000 exp ( − (x− 3/4) 2 + (y − 3/4)2 0.01 ) . For simplicity we set the heat conduction coefficient to κ = 1. Since we seek a stationary solution, we can set the time derivate in Eq. (1.3) equal to zero, and solve ∂2u ∂x2 + ∂2u ∂y2 = f(x, y), (1.4) with f(x, y) = −g(x, y)/κ. The problem of finding the stationary temperature profile u(x, y) is then fully specified by the following PDE boundary value problem (BVP): BVP ∂2u ∂x2 + ∂2u ∂y2 = −g(x, y) (x, y) ∈ Ω = [0, 1]× [0, 1] u(x, y) = u0 on ∂Ω, where ∂Ω denotes the boundary of the spatial domain Ω. We can approximate the solution to this problem numerically by discretising the PDE and solving the resulting linear system A~v = ~b. Eq. (1.4) is called Poisson’s equation, and it arises in many areas of applica- tion, including Newtonian gravity, electrostatics, or elasticity. When g(x, y) = 0, the equation is called Laplace’s equation. The symbol ∆ is often used as a short- hand notation for the differential operator in Eq. (1.4), and ∆u = ∂2u ∂x2 + ∂2u ∂y2 is called the Laplacian of u. Note that the 1D string problem described in the previous section features the 1D version of the Laplacian operator. The Laplacian operator can clearly also be extended to dimension 3 and higher. 8 Chapter 1. Introduction and Model Problems 1.2.1 Discretising the PDE We discretise the PDE by using finite difference approximations for the second- order partial derivatives that are similar to Eq. (1.1): ∂2u(x, y) ∂x2 = u(x+ h, y)− 2u(x, y) + u(x− h, y) h2 +O(h2), ∂2u(x, y) ∂y2 = u(x, y + h)− 2u(x, y) + u(x, y − h) h2 +O(h2). We consider a regular Cartesian grid that partitions the problem domain into squares of equal size by dividing both the x-range and the y-range into N + 1 intervals of equal length ∆x = ∆y = h = 1 N + 1 . The grid points xi and yj are given by xi = ih i = 0, . . . , N + 1, yj = jh j = 0, . . . , N + 1, (i.e., there are layers of boundary points at x0, xN+1, y0, and yN+1, and there are N2 interior points). We then approximate the unknown function u(x, y) (the exact solution to the BVP) at the grid points by discrete approximations wi,j : wi,j ≈ u(xi, yj), using the finite difference formula. That is, we solve the following discretised BVP for the unknown numerical approximation values wi,j : discretised BVP wi+1,j + wi,j+1 − 4wi + wi−1,j + wi,j−1 h2 = −g(xi, yj) (i, j = 1, . . . , N) xi = ih, yi = jh w0,j = wN+1,j = wi,0 = wi,N+1 = u0. (1.5) 1.2.2 Formulation as a Linear System Similar to the 1D model problem, the 2D discretised BVP can be written as a linear system A~v = ~b, with now N2 equations for the N2 unknowns wi,j (i, j = 1, . . . , N) at the interior points of the problem domain. Here, A ∈ Rn×n with total number of unknowns n = N2. We first have to assemble the N2 unknowns wi,j (i, j = 1, . . . , N) into a single vector ~v. We can do this using lexicographic ordering by rows, in which we assemble rows of wi,j in the spatial domain into ~v, from top to bottom 1.2. A 2D Example: Poisson’s Equation for Heat Conduction 9 starting from row j = 1, and from left to right within each row. For example, when N = 3, the vector ~v is given by ~v = w1,1 w2,1 w3,1 w1,2 w2,2 w3,2 w3,1 w3,2 w3,3 . Next, if we want to write the BVP as a linear system A~v = ~b of the N2 interior unknowns, the values of wi,j at boundary points of the domain need to be moved to the right-hand side (RHS) of the discretised PDE in Eq. (1.5). If we do this, the system matrix in A~v = ~b is given by the so-called 2-dimensional (2D) Laplacian matrix: Definition 1.4: 2D Laplacian Matrix (Model Problem 2) A = T I 0 . . . 0 I T I 0 . . . 0 0 I T I 0 . . . 0 … . . . . . . . . . … 0 . . . 0 I T I 0 0 . . . 0 I T I 0 . . . 0 I T ∈ Rn×n, (1.6) where n = N2 and T and I are block matrices ∈ RN×N : T = −4 1 0 . . . 0 1 −4 1 0 . . . 0 0 1 −4 1 0 . . . 0 … . . . . . . . . . … 0 . . . 0 1 −4 1 0 0 . . . 0 1 −4 1 0 . . . 0 1 −4 ∈ RN×N , (1.7) and I is the N ×N identity matrix. The vector ~b in A~v = ~b is given by −h2g(x, y) evaluated in xi and yj , plus a contribution of −u0 for every neighbour of wi,j that lies on the boundary. For the simple example with N = 3 (where only the midpoint of the grid does not 10 Chapter 1. Introduction and Model Problems have neighbour points on the boundary), ~b is given by ~b = −h2g1,1 − 2u0 −h2g2,1 − u0 −h2g3,1 − 2u0 −h2g2,1 − u0 −h2g2,2 −h2g2,3 − u0 −h2g3,1 − 2u0 −h2g3,2 − u0 −h2g3,3 − 2u0 , where gi,j = g(xi, yj) and h = 1/(N + 1). Note that the matrix A is block tridiagonal, and it is very sparse: it has very few nonzero elements (close to 5 per row, on average). Again, it is essential that efficient numerical methods for this problem exploit this sparsity, and the study of efficient numerical methods for sparse matrices like the 2D Laplacian matrix is an important focus of this unit. For example, a 2D resolution of 1000 × 1000 grid points is quite modest for scientific applications on current-day computers. In this case, A ∈ Rn×n with n = N2 = 106. Using Gaussian elimination (or, equivalently, LU decomposition) in a naive fashion (without taking advantage of the zeros in the sparse matrix), the number of floating point operations required, W , would scale like W = O(n3) = O(1018), which would take a very large amount of time. In this unit we will pursue methods for sparse matrices with work complexity approaching W = O(n). Such methods power many of today’s advances in science, engineering and technology. 1.2.3 Solving the Linear System When considering the linear system in Matlab using N = 64, we obtain the following plots for the source term and for the approximation of the temperature profile (surface and contour plots, using Matlab’s mesh and contour): 0 1 2000 0.8 4000 1 6000 0.6 0.8 source 8000 0.60.4 10000 0.4 0.2 0.2 0 0 1.2. A 2D Example: Poisson’s Equation for Heat Conduction 11 source 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 600 1 610 620 630 0.8 1 640 650 0.6 0.8 approximate temperature profile 660 670 0.60.4 680 0.4 0.2 0.2 0 0 approximate temperature profile 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12 Chapter 1. Introduction and Model Problems 1.3 An Example from Data Analytics: Netflix Movie Recommendation In 2006, the online DVD-rental and video streaming company Netflix launched a competition for the best collaborative filtering algorithm to predict user ratings for films, based on a training data set of previous ratings. Netflix provided a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies, with ratings from 1 to 5 (integral) stars. Let the number of users be given by m = 480, 189, and the number of movies by n = 17, 770. Each rating consists of a triplet (i, j, v), where i is the user ID, j is the movie ID, and v is the rating value in the range 1–5. The training ratings can be stored in a sparse ratings matrix R ∈ Rm×n. The set of matrix indices with known values is indicated by index set R = {(i, j)}. For example, a simple ratings matrix R with m = 7 users and n = 4 movies could be given by R = 2 3 5 1 1 5 1 5 2 , with index set R = {(1, 2), (2, 2), (3, 1), (4, 2), (5, 3), (5, 4), (6, 1), (6, 4), (7, 3)}. (To be precise, the ratings matrix is actually not a usual sparse matrix, in which values that are not stored are assumed to be zero, but rather an incomplete matrix, with values that are not stored considered unknown.) The goal of a collaborative filtering algorithm is to predict the unknown ratings in R based on the training data in R. These predicted ratings can then be used to recommend movies to users. In linear algebra, this type of problem is known as a matrix completion problem. The recommendation problem (for movies, music, books, . . . ) can be seen as a problem in the field of machine learning, which studies algorithms that can learn from and make predictions on data. In the sub-category of supervised learning, the computer is presented with example inputs and their desired outputs (the training data set), and the goal is to learn a general rule that maps inputs to outputs. 1.3.1 Movie Recommendation using Linear Algebra and Optimisation A powerful approach to attack the matrix completion problem is to seek matrices U ∈ Rf×m and V ∈ Rf×n, with f a small integer m,n, such that UTM approximates the ratings matrix R on the set of known ratings, R. Pictorially, 1.3. An Example from Data Analytics: Netflix Movie Recommendation 13 we seek U and M such that R ≈ UT M (1.8) In practice, we will seek U and M that are dense, and we will allow their elements to assume any real value. Each row in these matrices represents a latent feature or factor of the data. The UTM decomposition of R effectively seeks to provide a model that with a small number of features, f (typically chosen ≤ 50), is able to provide good predictions for the unknown values in R. The user and movie matrices U and M have shape U = ~ui , (1.9) M = ~mj . (1.10) The column vectors of U , ~ui ∈ Rf , are called the user feature vectors, and the column vectors of M , ~mj ∈ Rf , are called the movie feature vectors. With f m,n, the interpretation of the approximation UTM is that, for each user i and movie j, their affinity for each of the f latent ‘feature categories’ is encoded in the vectors ~ui and ~mj . (For instance, if feature k were to represent the ‘commedy’ category, uk,i would express to which degree user i is into comedies, and mk,j would express to which degree movie j is a comedy.) The approximation UTM ofR with small f is called a low-rank approximation of R, since UT M = f∑ k=1 (UT )∗k(M)k∗, (1.11) where (UT )∗k is the kth column of UT and (M)k∗ is the kth row of M , and the terms (UT )∗k(M)k∗ are m× n matrices of matrix rank 1. We can seek user and movie matrices U and M that optimally approximate the rating matrix R, if we choose a specific sense in which UTM should approx- imate R. We define the Frobenius norm of a matrix by 14 Chapter 1. Introduction and Model Problems Definition 1.5: Frobenius Norm of a Matrix Let A ∈ Rm×n. Then the Frobenius norm of A is given by ‖A‖F = √√√√ m∑ i=1 n∑ j=1 a2ij . It is natural, then, to seek U and M such that the following measure of the difference between UTM and R is minimised: g(U,M) = ‖R− UTM‖2F,R , (1.12) where the ‖·‖F,R norm is a partial Frobenius norm, summed only over the known entries of R, as given by the index set R. In practice, it is necessary to add a regularisation term to g(U,M), to ensure the optimisation problem is well-posed and gives useful results. So the final optimization problem we seek to solve for the recommendation task is min U,M g(U,M) = ∑ (i,j)∈R ( rij − ~uTi ~mj )2 + λ m∑ i=1 nnz((R)i∗) ‖~ui‖22 + n∑ j=1 nnz((R)∗j) ‖~mj‖22 , (1.13) where nnz((R)i∗) is the number of movies ranked by user i, and nnz((R)∗j) is the number of users that ranked movie j. The regularisation parameter λ is a fixed number that can be chosen by trial-and-error or by techniques such as cross-validation. 1.3.2 An Alternating Least Squares Approach to Solving the Optimisation Problem We seek U ∈ Rf×m and V ∈ Rf×n that minimize g(U,M). A popular way of solving the optimisation problem is to determine U and M in an alternating fashion: starting from an initial guess for M , determine the optimal U with M fixed, then determine the optimal M with U fixed, and so forth. As it turns out, each subproblem of determining U with fixed M (and vice versa) in this alternating algorithm boils down to a (regularized) linear least-squares problem, and the resulting procedure is called Alternating Least Squares (ALS). Also, with fixed M , each column of U can be determined independent of the other columns (and vice versa for M , with fixed U). This means that ALS can be executed efficiently in parallel, which makes it suitable for big data sets. The figure below (from [Winlaw et al., 2015]) shows the performance of ALS on a small ratings matrix of size 400 × 80. Typically ALS requires quite a few iterations to reach high accuracy, and it is possible to improve its convergence behaviour, for example, as shown for the ALS-NCG method. 1.3. An Example from Data Analytics: Netflix Movie Recommendation 15 One of the focus areas of Part I of this unit is to solve least-squares problems in accurate and efficient ways. We will return to the movie recommendation problem in that context. In particular, we will learn how to derive the formulas for determining U with fixed M , and vice versa, and will use them to solve movie recommendation problems. PS: On September 21, 2009, the grand prize of US$1,000,000 for the Net- flix prize competition was given to the BellKor’s Pragmatic Chaos team which bested Netflix’s own algorithm for predicting ratings by 10.06% (using a blend of approaches, including multiple variations of the matrix factorization approach). PPS: Although the Netflix prize data sets were constructed to preserve cus- tomer privacy, in 2007, two researchers showed it was possible to identify indi- vidual users by matching the data sets with film ratings on the Internet Movie Database. On December 17, 2009, four Netflix users filed a class action law- suit against Netflix, alleging that Netflix had violated U.S. fair trade laws and the Video Privacy Protection Act by releasing the data sets. The sequel to the Netflix prize was canceled. We are living in a crazy world. 16 Chapter 1. Introduction and Model Problems Chapter 2 LU Decomposition for Linear Systems 2.1 Gaussian Elimination and LU Decomposition We consider nonsingular linear systems A~x = ~b where A ∈ Rn×n. We recall the following theorem about solvability of linear systems. Theorem 2.1 Let A ∈ Rn×n be nonsingular (i.e., det(A) 6= 0), and let ~b ∈ Rn. Then the linear system A~x = ~b has a unique solution, given by ~x = A−1~b. If A is singular, A~x = ~b either has infinitely many solutions (if ~b ∈ range(A)), or no solution (if ~b /∈ range(A)). 2.1.1 Gaussian Elimination We first consider standard Gaussian elimination (GE) and assume that no zero pivot elements are encountered, so no pivoting (switching of rows) is required. Example 2.2: One Step of Gaussian Elimination Let A = 2 3 46 8 4 8 9 0 In the first step of GE, 2 is the pivot element, and we add -6/2 times row 1 to row 2, and add -8/2 times row 1 to row 3, resulting in A = 2 3 40 −1 −8 0 −3 −16 17 18 Chapter 2. LU Decomposition for Linear Systems For the case of a general system A~x = ~b, we can write the result of one step of GE for a11 ~r T 1 ~c1 A (2) x1 ~x(2) = b1 ~b(2) , as a11 ~r T 1 0 A(2) − ~c1 a11 ~rT1 x1 ~x(2) = b1 ~b(2) − ~c1 a11 b1 . 2.1.2 LU Decomposition The following theorem and its proof show us that Gaussian elimination on A ∈ Rn×n (when no zero pivots are encountered) is equivalent to decomposing A as the product LU of two triangular matrices, and tell us how to construct the L and U factors. Theorem 2.3: LU Decomposition Let A ∈ Rn×n be a nonsingular matrix. Assume no zero pivots arise when applying standard Gaussian elimination (without pivoting) to A. Then A can be decomposed as A = LU , where L ∈ Rn×n is unit lower trian- gular, and U ∈ Rn×n is upper triangular and nonsingular. Proof. The proof proceeds by mathematical induction on n. Base Case: The statement holds for n = 1, since for any a ∈ R1 = R, with a 6= 0 (i.e., 1/a exists so a is nonsingular), the LU decomposition a = l u exists, with l = 1 and u = a nonsingular. Induction step: we show that, if the statement of the theorem holds for n− 1, then it holds for n. We perform one step of Gaussian elimination on n× n matrix A = a11 ~r T 1 ~c1 A (2) , which is assumed nonsingular and such that no zero pivots arise when applying GE to it. 2.1. Gaussian Elimination and LU Decomposition 19 Since a11 6= 0 we can define the Gauss transformation matrix M (1) = 1 0 ~m1 I (2) , with ~m1 = −~c1/a11 and I(2) the identity matrix of size (n−1)× (n−1). Then the first step of Gaussian elimination can be written as M (1)A = a11 ~r T 1 0 A(2) + ~m1~r T 1 = a11 ~r T 1 0 A˜ , where A˜ is an (n − 1) × (n − 1) matrix for which no zero pivots arise when applying GE to it (since the same holds for A); this also implies A˜ is nonsin- gular. By the induction hypothesis, A˜ can be decomposed as L˜U˜ , which leads to M (1)A = a11 ~r T 1 0 L˜U˜ = 1 0 0 L˜ a11 ~r T 1 0 U˜ or A = ( M (1) )−1 1 0 0 L˜ a11 ~r T 1 0 U˜ . The inverse of M (1) is easily obtained from observing that 1 0 −~m1 I(2) 1 0 ~m1 I (2) = I = (M (1))−1M (1). Then A = 1 0 −~m1 I(2) 1 0 0 L˜ a11 ~r T 1 0 U˜ = LU, 20 Chapter 2. LU Decomposition for Linear Systems with L = 1 0 −~m1 L˜ . (2.1) Matrix U is nonsingular: det(U) = a11 det(U˜) 6= 0, since U˜ is nonsingular and a11 6= 0. This proves the induction step and completes the proof. Eq. (2.1) shows that L can be obtained by inserting the multiplier elements −~m1 = ~c1/a11 in its columns, for every step of Gaussian elimination. Note also that, by the construction in the proof, the LU decomposition is unique (when no pivoting is performed). If pivoting is employed during Gaussian elimination (e.g., when zero pivots arise), a similar theorem holds: Theorem 2.4: LU Decomposition For any A ∈ Rn×n, a decomposition PA = LU exists where P ∈ Rn×n is a permutation matrix, L ∈ Rn×n is unit lower triangular, and U ∈ Rn×n is upper triangular. Here, P encodes the row permutations of the pivoting operations. The PA = LU decomposition is unique when P is fixed, and this theorem also holds for singular A. Remark 2.5: Solving a Linear System using LU Decomposition We can solve A~x = ~b in three steps: 1. compute L and U in the decomposition A = LU , leading to LU~x = ~b 2. solve L~y = ~b using forward substitution 3. solve U~x = ~y using backward substitution 2.1.3 Implementation of LU Decomposition and Computational Cost Implementation of LU Decomposition A basic implementation of LU decomposition in Matlab-like pseudo-code is given by 2.1. Gaussian Elimination and LU Decomposition 21 Algorithm 2.6: LU decomposition, kij version Input: Matrix A Output: L and U U=A; L=I ; for k=1:n−1 % p i v o t k for i=k+1:n % row i m=u( i , k )/u(k , k ) ; u ( i , k )=0; for j=k+1:n % column j u( i , j )=u( i , j )−m∗u(k , j ) ; end l ( i , k)=m; end end However, we can also implement LU decomposition in-place: Algorithm 2.7: LU decomposition, kij version, in-place Input: Matrix A Output: L and U stored in A for k=1:n−1 % p i v o t k for i=k+1:n % row i a ( i , k)=a ( i , k )/ a (k , k ) ; for j=k+1:n % column j a ( i , j )=a ( i , j )−a ( i , k )∗ a (k , j ) ; end end end Also, we can depart from the standard order of operations in Gaussian elimi- nation and consider to do all the operations for row i of A at once, in the so-called ikj version of the algorithm: Algorithm 2.8: LU decomposition, ikj version, in-place Input: Matrix A Output: L and U stored in A for i =2:n % row i for k=1: i−1 % p i v o t k a ( i , k)=a ( i , k )/ a (k , k ) ; for j=k+1:n % column j a ( i , j )=a ( i , j )−a ( i , k )∗ a (k , j ) ; end end end 22 Chapter 2. LU Decomposition for Linear Systems Computational Work for LU Decomposition We now consider the amount of computational work that is spent by the LU decomposition algorithm, in terms of the number of floating point operations (flops) performed to decompose an n × n matrix A. We count the number of additions, and subtractions (which we indicate by A) and the number of multiplications, divisions, and square roots (indicated by M). We assume that these operations take the same amount of work, which is a reasonable assumption for modern computer processors. The following summation identities are useful when determining computa- tional work: n−1∑ p=1 1 = n− 1, n−1∑ p=1 p = 1 2 n(n− 1), n−1∑ p=1 p2 = 1 6 n(n− 1)(2n− 1). We consider the kij version of the algorithm and sum over the three nested loops to determine the work W of LU decomposition: W = n−1∑ k=1 n∑ i=k+1 (1M + n∑ j=k+1 (1M + 1A)) = n−1∑ k=1 n∑ i=k+1 (1 + 2(n− k)) = n−1∑ k=1 n∑ i=k+1 (1 + 2n− 2k) = n−1∑ k=1 (1 + 2n− 2k)(n− k) = n−1∑ k=1 ((n+ 2n2)− k(2n+ 1 + 2n) + 2k2) = (n− 1)(n+ 2n2)− (4n+ 1)(n− 1)n/2 + 2n(n− 1)(2n− 1)/6 = (2− 2 + 2/3)n3 +O(n2) = 2 3 n3 +O(n2)flops. As expected, the dominant term in the expression for the computational work is proportional to n3, since LU decomposition entails three nested loops that are of (average) length proportional to n, roughly speaking. We say that the com- putational complexity of LU decomposition is cubic in the number of unknowns, n. For example, for the 2D model problem of Eq. (1.6), with n = N2, we have W = O(n3) = O(N6). For large problems, cubic complexity is often prohibitive, and we will seek to exploit structural properties like sparsity to obtain methods with lower computational complexity. A similar computation shows that Forward substitution L~y = ~b and Back- ward substitution U~x = ~y have computational work W = n2 +O(n)flops. 2.2. Banded LU Decomposition 23 LU Decomposition for Symmetric Positive Definite Matrices Finally, we note that if the matrix A is symmetric positive definite (SPD), pivot- ing is never required in the LU decomposition and the symmetry can be exploited to save about half the work. Theorem 2.9 If A ∈ Rn×n is SPD, the decomposition A = LU , where L is unit lower triangular and U is upper triangular, exists and is unique. The above theorem implies that no zero pivot elements can occur in the LU decomposition algorithm for SPD matrices (in exact arithmetic). Theorem 2.10: Cholesky decomposition If A ∈ Rn×n is SPD, the decomposition A = L̂L̂T , where L̂ is a lower triangular matrix with strictly positive diagonal elements, exists and is unique. In fact, it can be shown that L̂ = L √ D and L̂T = √ D−1 U , where D is the diagonal matrix containing the diagonal elements of U , which are strictly positive for an SPD matrix such that their square root can be taken in the diagonal matrix√ D. The work to compute the Cholesky decomposition is W = 13n 3 +O(n2). 2.2 Banded LU Decomposition In this section, we consider special versions of the LU algorithm that save work for sparse matrices that are zero outside a band around the diagonal. Definition 2.11 A banded matrix A ∈ Rn×n is a sparse matrix whose nonzero entries are confined to a band around the main diagonal. I.e., ∃K < n s.t. aij = 0 ∀i, j s.t. |i− j| > K. The smallest such K is called the bandwidth of A. For example, for a diagonal matrix we have K = 0. For a tridiagonal matrix, we have K = 1. For our 2D model problem, we have K = N − 1. It turns out that, if A has bandwidth K, then we need to compute the U and L factors only within the band. This can be proved formally, but it can also be seen intuitively by considering, e.g., the kij version as in Algorithm 2.7. First, the statement a(i,k)=a(i,k)/a(k,k) cannot create new nonzeros. Second, the statement a(i,j)=a(i,j)-a(i,k)*a(k,j) can only create new nonzeros if both the multiplier a(i,k) and the element in the pivot row a(k,j) are nonzero. It turns out that a(i,k)=0 outside the band for all rows i on the left, and a(k,j)=0 outside the band for all rows i on the right. So the banded structure maintains additional zero elements in L and U according to lij = 0 if i− j > B, uij = 0 if j − i > B, so nonzeros in row i of L don’t occur before column j = i−B, and nonzeros in row i of U don’t occur after column j = i+B. 24 Chapter 2. LU Decomposition for Linear Systems This means that, for banded matrices with bandwidth B, we can safely mod- ify the ranges of the loops in the the ikj version of the LU algorithm as follows: Algorithm 2.12: Banded LU decomposition, ikj version, in-place Input: Matrix A with bandwidth B Output: L and U stored in A for i =2:n % row i for k=max(1 , i−B) : i−1 % p i v o t k a ( i , k)=a ( i , k )/ a (k , k ) ; for j=k+1:min( i+B, n) % column j a ( i , j )=a ( i , j )−a ( i , k )∗ a (k , j ) ; end end end Computational Work for Banded LU Decomposition The amount of computational work for banded LU can be estimated as follows. We sum over the three nested loops and obtain an upper bound for the work: W = n∑ i=2 i−1∑ k=max(1,i−B) (1 + min(i+B,n)∑ j=k+1 2) flops ≤ n∑ i=2 i−1∑ k=max(1,i−B) (1 + i+B∑ j=k+1 2) = n∑ i=2 i−1∑ k=max(1,i−B) (1 + 2(i+B − k)) ≤ n∑ i=2 i−1∑ k=i−B (1 + 2(i+B)− 2k) = n∑ i=2 B(1 + 2(i+B))− 2 i−1∑ k=i−B k = n∑ i=2 B(1 + 2(i+B))− 2( i−1∑ k=1 k − i−B−1∑ k=1 k) = n∑ i=2 B(1 + 2(i+B))− (i(i− 1)− (i−B)(i−B − 1)) = n∑ i=2 B(1 + 2B + 2i))− (2Bi−B −B2) = n∑ i=2 3B2 + 2B ≤ n(3B2 + 2B), 2.3. Matrix Norms 25 so W = O(B2n). Notes: • For the 1D model problem, B = 1, so we get W ≤ n(3 + 2) = 5n, i.e., W = O(n). (This boils down to the so-called Thomas algorithm.) • For the 2D model problem, with n = N2, we have B = N , so W = O(B2n) = O(N2n) = O(n2) = O(N4), which is much better than the W = O(n3) = O(N6) cost of the regular LU algorithm. (E.g., compare for N = 103; you save a factor 106 in work.) • Further improvements in cost for the 2D model problem can be obtained using more advanced techniques, which reorder the variables and equations to minimize the bandwidth, or, more generally, to minimize the fill-in (i.e., the creation of new non-zeros) in the L and U factors. For example, the so- called nested dissection algorithm obtains W = O(n3/2) for the 2D model problem. Still, it is possible to do better (up to W = O(n)) using iterative methods, which, rather than direct methods like Gaussian elimination that solve A~x = ~b exactly (in exact arithmetic) after n steps, solve the problem iteratively starting from an initial guess ~x0 that is iteratively improved over a number of steps until a desired accuracy is reached, typically in a number of steps that is much smaller than n. These iterative methods are the subject of the last three chapters in Part I of these notes. 2.3 Matrix Norms In order to discuss accuracy and stability of algorithms for solving linear system, we need to define ways to measure the size of a matrix. For this reason, we consider the following matrix norms. 2.3.1 Definition of Matrix Norms Definition 2.13: Natural or Vector-Induced Matrix Norm Let ‖ · ‖p be a vector p-norm. Then for A ∈ Rn×n, the matrix norm induced by the vector norm is given by ‖A‖p = max ~x6=0 ‖A~x‖p ‖~x‖p . 26 Chapter 2. LU Decomposition for Linear Systems Note: alternatively, we may also write ‖A‖p = max‖~x‖p=1 ‖A~x‖p. Theorem 2.14 Let A ∈ Rn×n. The vector-induced matrix norm function ‖A‖p is a norm on the vector space of real n × n matrices over R. That is, ∀A,B ∈ Rn×n and ∀ a ∈ R, the following hold 1. ‖A‖ ≥ 0, and ‖A‖ = 0 iff A = 0 2. ‖aA‖ = |a|‖A‖ 3. ‖A+B‖ ≤ ‖A‖+ ‖B‖. In addition, the following properties also hold: Theorem 2.15 1. ‖A~x‖p ≤ ‖A‖p‖~x‖p 2. ‖AB‖p ≤ ‖A‖p‖B‖p. Here we only prove part 1. Proof. If ~x = 0, the inequality holds. For any ~x 6= 0, we have ‖A~x‖p ‖~x‖p ≤ max~x6=0 ‖A~x‖p ‖~x‖p = ‖A‖p, by the definition of matrix norm. Hence, ‖A~x‖p ≤ ‖A‖p‖~x‖p. Note: the Frobenius norm introduced in Def. 1.5 is an example of a matrix norm that is not induced by a vector norm. 2.3.2 Matrix Norm Formulas We can derive the following specific expressions for some commonly used matrix p-norms. 2.3. Matrix Norms 27 Theorem 2.16 Let A ∈ Rn×n. 1. ‖A‖∞ = max 1≤i≤n n∑ j=1 |aij | “maximum absolute row sum” 2. ‖A‖1 = max 1≤j≤n ( n∑ i=1 |aij | ) “max absolute column sum” 3. ‖A‖2 = max 1≤i≤n √ λi(ATA) = max 1≤i≤n √ λi(AAT ) = max 1≤i≤n σi where λi(A TA) are the eigenvalues of ATA and σi are the singular values of A. Here we only prove part 1. Proof. We will derive the formula for the matrix infinity norm using the second variant of the defition, ‖A‖∞ = max‖~x‖∞=1 ‖A~x‖∞. Also, observe that ‖~x‖∞ = 1 iff max1≤i≤n |xi| = 1. Let r = max 1≤i≤n n∑ j=1 |aij | (maximum absolute row sum). We first show that ‖A‖∞ ≤ r. This follows from ‖A~x‖∞ ≤ r if ‖~x‖∞ = 1, since then |(A~x)i| = ∣∣∣∣∣∣ n∑ j=1 aijxj ∣∣∣∣∣∣ ≤ n∑ j=1 |aij ||xj | ≤ n∑ j=1 |aij | ≤ r for any i. Now, to show that ‖A‖∞ = r, it is sufficient to find a specific ~y s.t. ‖~y‖∞ = 1 and ‖A~y‖∞ = r. Let ν be the index of a row in A with maximum absolute row sum, meaning that n∑ j=1 |aνj | = r. 28 Chapter 2. LU Decomposition for Linear Systems Define ~y as follows: yj := sign(aνj) = 1 if aνj > 00 if aνj = 0−1 if aνj < 0 This ~y converts each aνjyj into |aνj | in the formula for the νth component of the product A~y, so we have: |(A~y)ν | = ∣∣∣∣∣∣ n∑ j=1 aνjyj ∣∣∣∣∣∣ = ∣∣∣∣∣∣ n∑ j=1 |aνj | ∣∣∣∣∣∣ = n∑ j=1 |aνj | = r. Therefore ‖A~y‖∞ = r with ‖~y‖∞ = 1, and so ‖A‖∞ = r. 2.3.3 Spectral Radius Definition 2.17 Let A ∈ Rn×n with eigenvalues λi, i = 1, . . . , n. The spectral radius ρ(A) of A is given by ρ(A) = max 1≤i≤n |λi|. Theorem 2.18 Let A ∈ Rn×n. For any matrix p-norm, it holds that ρ(A) ≤ ‖A‖p. Remark 2.19 The matrix 2-norm formula simplifies as follows when A is symmetric: ‖A‖2 = max 1≤i≤n √ λi(ATA) = max 1≤i≤n √ λi(A2) = max 1≤i≤n √ λi(A)2 = max 1≤i≤n |λi(A)| = ρ(A). 2.4. Floating Point Number System 29 2.4 Floating Point Number System 2.4.1 Floating Point Numbers Definition 2.20 The floating point number system F (β, t, L, U) consists of the set of floating point numbers x of format x = ±d1.d2d3 · · · dt βe = m βe where m = ±d1.d2d3 · · · dt is called the mantissa, β is called the base, e is called the exponent, and t is called the number of digits in the mantissa. The digits di are specified by di ∈ {0, 1, . . . , β − 1} (i = 2, . . . , t) d1 ∈ {1, . . . , β − 1} and the exponent satisfies L ≤ e ≤ U. Note: The mantissa is normalised, by requiring d1 to be nonzero. 2.4.2 Rounding and Unit Roundoff Definition 2.21 Let x ∈ R. The rounded representation of x in F (β, t, L, U) is indicated by fl(x). Most computer systems use the rounding rule round to nearest, tie to even, as in the following example. (The tie-to-even part serves to avoid bias up or down.) Example 2.22 Consider floating point number system F (β = 10, t = 4, L = −10, U = 10). Some examples illustrating the round to nearest, tie to even rule: x = 123.749 fl(x) = 1.237 102 x = 123.751 fl(x) = 1.238 102 x = 123.750 fl(x) = 1.238 102 (tie!) x = 123.850 fl(x) = 1.238 102 (tie!) 30 Chapter 2. LU Decomposition for Linear Systems Theorem 2.23 Consider a floating point number system F (β, t, L, U) with a rounding-to- nearest rule. Let fl(x) be the rounded representation of x ∈ R, x 6= 0. Then the relative error in the representation of x is bounded by |x− fl(x)| |x| ≤ µ = 1 2 β−t+1. Here, µ is called the unit roundoff (also, sometimes, machine precision or machine epsilon). Proof. Let x = m βe and fl(x) = m βe. Since m = ± d1.d2d3 · · · dtdt+1 . . . = ± d1 + d2β−1 + d3β−2 + · · ·+ dtβ−t+1 + dt+1β−t + . . . , and rounding to nearest with t digits is used, we have |m−m| ≤ 1 2 β−t+1, so |x− fl(x)| ≤ 1 2 β−t+1βe, or |x− fl(x)| |x| ≤ 1 2 β−t+1 |m| βe βe ≤ 1 2 β−t+1 = µ, because m ≥ 1. Note: We can also write fl(x) = x(1 + ν) with |ν| ≤ µ, because ν = (fl(x)− x)/x so |ν| ≤ µ. 2.4.3 IEEE Double Precision Numbers The IEEE double precision standard is being used on most computers for repre- senting floating point numbers in hardware and carrying out computations with them. For instance, Matlab normally uses double precision numbers. Higher precision numbers can be represented in software, but are much slower to work with than the native hardware representations. 2.4. Floating Point Number System 31 Example 2.24: IEEE Double Precision Numbers The IEEE double precision floating point number system is based on F (β = 2, t = 53, L = −1022, U = 1023). It is a binary system with 53 digits in the mantissa, and exponent range from -1022 to 2023. It represents numbers in the format x = 1.01001 · · · 001 2e = m βe. Here, the first digit of the mantissa m = ±1.f does not need to be stored because it is always 1 (due to the normalisation). The fraction f has 52 digits. The sign of the mantissa is stored in a sign bit s. A shifted form of the exponent is stored: E = e+ 1023, such that E is an integer between 1 and 2046, which can be represented by 11 bits (211 = 2048). In total, storing an IEEE double precision number in computer memory requires 64 bits (i.e., 8 bytes): s f E 1 bit 52 bits 11 bits Numbers with E in the range 1 ≤ E ≤ 2046 represent the standard normalised numbers. The values E = 0 and E = 2047 are used to represent special numbers: 1 ≤ E ≤ 2046 : x = (−1)s(1.f)2E−1023 (normalised numbers) E = 2047 : f 6= 0 =⇒ x = NaN (not a number, e.g. 0/0) f = 0 =⇒ x = (−1)sInf (infinity, e.g. 1/0) E = 0 : f = 0 =⇒ x = 0 f 6= 0 =⇒ (denormalised numbers: mantissa is not normalised) e.g. x = 0.0001011010 . . . 0 2−1022 With β = 2 and t = 53, the unit roundoff is µ = 1 2 β−t+1 = 1 2 β−53+1 = 2−53 ≈ 1.1 10−16, which is roughly equivalent to β = 10, t = 16 (then µ = 0.5 10−16+1 = 5 10−16. We say that double precision binary numbers have between 16 and 17 decimal digits of (relative) accuracy. The smallest positive nonzero (normalised) number (realmin in Matlab) is 1.0 . . . 0 2−1022 ≈ 2.2 10−308, and the largest positive number (realmax in Matlab) is 1.1 . . . 1 21023 = (2− 252) 21023 ≈ 21024 ≈ 1.8 10308. Note, in Matlab, eps is the distance from 1 to the next larger floating point number. We have eps=2µ. 32 Chapter 2. LU Decomposition for Linear Systems 2.4.4 Rounding and Basic Arithmetic Operations Basic arithmetic operations such as addition, subtraction, multiplication, divi- sion, and square root, are implemented in computer hardware such that the rounded representation of the exact result is obtained. (This is achieved by us- ing additional digits of precision when computing intermediate results.) More precisely, assume x and y are floating point numbers stored in computer memory, after rounding (i.e., x = fl(x), and y = fl(y)). Let x+ y be the result computed and stored by the computer (after rounding). Then the IEEE stan- dard requires that the + operation be implemented in computer hardware such that x+ y = fl(x+ y), i.e., the result of x+ y evaluated on the computer is the exact x + y, rounded to its floating point representation. This is a stringent requirement! This also implies x+ y = (x+ y)(1 + ν) with |ν| ≤ µ. Similary, we have x− y = fl(x− y), x ∗ y = fl(x ∗ y), x/y = fl(x/y), √ x = fl( √ x). Other standard functions like sin(x) and exp(x) are typically implemented in software, and don’t have the same accuracy guarantees. When they are evalu- ated, we can normally assume that the relative errors satisfy bounds like sin(x) = sin(x)(1 + c1ν), exp(x) = exp(x)(1 + c2ν), with |ν| ≤ µ and c1 and c2 constants not much larger than 1, see, e.g., https:// blogs.mathworks.com/cleve/2017/01/23/ulps-plots-reveal-math-function- accurary. 2.5 Conditioning of a Mathematical Problem 2.5.1 Conditioning of a Mathematical Problem Consider the mathematical problem P to find output ~z from input ~x with the relation by ~z and ~x given by the function f : Problem 2.25: Mathematical Problem P P: ~z = f(~x) The concept of “conditioning” of problem P relates to the sensitivity of ~z to changes in ~x. We perturb ~x by ∆~x and investigate the effect of this perturbation on ~z: ~z + ∆~z = f(~x+ ∆~x). 2.5. Conditioning of a Mathematical Problem 33 Definition 2.26 Consider mathematical problem P: ~z = f(~x) with perturbed input: ~z + ∆~z = f(~x+ ∆~x). 1. Problem P is called ill-conditioned with respect to absolute errors if the absolute condition number κA = ‖∆~z‖ ‖∆~x‖ (∆~x 6= 0) satisfies κA 1. P is called well-conditioned otherwise. 2. Problem P is called ill-conditioned with respect to relative errors if the relative condition number κR = ‖∆~z‖ ‖~z‖ ‖∆~x‖ ‖~x‖ (∆~x 6= 0, ~z 6= 0, ~x 6= 0) satisfies κR 1. P is called well-conditioned otherwise. Note: Ill-conditioning is often considered relative to the precision of the com- puter and number system being used. For example, for double precision numbers, the unit roundoff µ ≈ 1.1 10−16, indicating that number representation and el- ementary computations have a relative accuracy of 16 decimal digits. If the problem is ill-conditioned with κR ≈ 1/µ ≈ 1016, you cannot expect any correct digits in your computation. If κR ≈ √ 1/µ ≈ 108, you can expect about half of the digits in the computed result to be correct (if you use an algorithm that is numerically stable, see the next section). If κR ≈ 1, you can expect almost all digits to be correct when using a stable algorithm. Note: We did not specify in which norm to evaluate the condition numbers. Depending on the problem, some norms may be easier to work with than others. 2.5.2 Conditioning of Elementary Operations Example 2.27: Conditioning of the Sum Operation We investigate the conditioning of mathematical problem P: z = x+ y. We have z + ∆z = x+ ∆x+ y + ∆y, leading to ∆z = ∆x+ ∆y. 34 Chapter 2. LU Decomposition for Linear Systems Using the 1-norm, we find for the absolute condition number κA = |∆z| ‖(∆x,∆y)‖1 = |∆z| |∆x|+ |∆y| = |∆x+ ∆y| |∆x|+ |∆y| ≤ |∆x|+ |∆y||∆x|+ |∆y| = 1, so addition is well-conditioned w.r.t. the absolute error: the absolute error in z is never much larger than the absolute errors in x or y. However, again using the 1-norm, we find for the relative condition number κR = |∆z| |z| ‖(∆x,∆y)‖1 ‖(x,y)‖1 = |∆x+∆y| |x+y| ‖(∆x,∆y)‖1 ‖(x,y)‖1 = |x|+ |y| |x+ y| |∆x+ ∆y| |∆x|+ |∆y| ≤ |x|+ |y||x+ y| . The upper bound for κR shows that the problem is well-conditioned as long as x+ y 6≈ 0. However, the relative condition number can be arbitrarily large when x + y ≈ 0, i.e., when one subtracts two numbers of almost equal size, x ≈ −y. In this case, the relative error in z can be much greater than the relative error in x and y. When x ≈ −y, addition is ill-conditioned w.r.t. the relative error. This blow-up of the relative error, and the loss of relative accuracy that goes along with it, is referred to as catastrophic cancellation. Example 2.28: An Example of Catastrophic Cancellation Compute z = x+ y with x = 1.000002, ∆x = 10−6, x+ ∆x = 1.000003, |∆x| |x| ≈ 10 −6, y = −1.000013, ∆y = −2 10−6, y + ∆y = −1.000015, |∆y||y| ≈ 2 10 −6, where ∆x and ∆y may be due, for example, to floating point rounding on a computer. 2.5. Conditioning of a Mathematical Problem 35 We have z = x+ y = −0.000011, ∆z = ∆x+ ∆y = −10−6, z + ∆z = −0.000012, so |∆z| |z| = 0.09, i.e., we have a 9% relative error in z, whereas the relative error in x and y was only of the order of 0.0001%. This blow-up in relative error is due to catastrophic cancellation. Example 2.29: A Second Example of Catastrophic Cancellation In the context of perturbations due to rounding in a floating point system, we can consider the following example of catastrophic cancellation. Consider floating point system F (β = 10, t = 5, L = −10, U = 10), with t = 5 digits in the mantissa and unit roundoff µ = 1 2 β−t+1 = 0.00005. We compute z = x− y for the following numbers x ≈ y with rounded floating point representation x = fl(x) and y = fl(y): x = 1.23456789, x = fl(x) = 1.2346, y = 1.23111111, y = fl(y) = 1.2311. The absolute and relative errors in x and y due to rounding are ∆x = fl(x)− x ≈ 3.2 · 10−5, |∆x||x| ≈ 2.6 · 10 −5 ≤ µ, ∆y = fl(y)− y ≈ −1.1 · 10−5, |∆y||y| ≈ 9.0 · 10 −6 ≤ µ. Computing the difference of x and y, we have for the exact z and floating point result z: z = x− y = 0.00345678, z = fl(fl(x)− fl(y)) = fl(0.0035) = 0.0035, 36 Chapter 2. LU Decomposition for Linear Systems so ∆z = z − z ≈ 4.3 · 10−5, |∆z||z| ≈ 0.013, i.e., we obtain a result with a 1% relative error in z, whereas the relative error in x and y was only of the order of 0.005%. This blow-up of the relative error is due to catastrophic cancellation. Equivalently, we can see that we only have two correct digits in z, while we had 5 correct digits in x and y, and the computer used can represent 5 correct digits. So when computing z, 3 of the 5 digits in relative accuracy were lost due to catastrophic cancellation. Something to remember . . . When devising numerical algorithms, avoid steps where two almost equal num- bers are subtracted, if you can. (This ill-conditioned step in the algorithm may cause the algorithm to be numerically unstable, due to blow-up of the relative error, as explained in Section 2.6.) Example 2.30: Conditioning of the Division Operation We investigate the conditioning of mathematical problem P: z = x y (y 6= 0). We have z + ∆z = x+ ∆x y + ∆y , or ∆z = −z + x(1 + ∆x/x) y(1 + ∆y/y) , which leads to ∆z z = −1 + 1 + ∆x/x 1 + ∆y/y = −1−∆y/y + 1 + ∆x/x 1 + ∆y/y or ∆z z = ∆x x − ∆y y 1 + ∆y y . (2.2) In terms of relative conditioning, Eq. (2.2) shows immediately that ∆z/z can only be large if ∆x/x or ∆y/y are large, which means that the relative error does not blow up in a division operation and the problem is well-conditioned. (Note that the relative condition number κR = |∆z| |z| ‖(∆x,∆y)‖ ‖(x,y)‖ 2.5. Conditioning of a Mathematical Problem 37 does not easily lead to a useful bound in this case.) In terms of absolute conditioning, however, we have κA = |∆z| ‖(∆x,∆y)‖ = ∣∣∣∣xy ∣∣∣∣ ∣∣∣∣∆xx − ∆yy ∣∣∣∣∣∣∣∣1 + ∆yy ∣∣∣∣ (|∆x|+ |∆y|) . Assuming that ∆x/x and ∆y/y are small, κA can be arbitrarily large if y approaches 0. This means that the absolute error may blow up if y ≈ 0 (as can also be seen directly from Eq. (2.2)). Note that large |x| may also lead to large κA, but if x is large, |∆x| can often also be expected to be large proportional to x (in particular, if |∆x| is due to rounding in a floating point number system), which would make κA small again. In summary, divison is ill-conditioned with respect to absolute error when y ≈ 0; in that case the absolute error of the result blows up. (Note that this can be seen very easily by considering division by a small y without error: in that case z + ∆z = x+ ∆x y , so ∆z = ∆x y and ∆z clearly blows up when y ≈ 0.) Example 2.31: Ill-Conditioning when Dividing by a Small Number Compute z = x y with x = 1, ∆x = 10−3, x+ ∆x = 1.001, |∆x| |x| = 10 −3, y = 10−6, ∆y = 10−12, y + ∆y ≈ 10−6, |∆y||y| = 10 −6. Then z = x/y = 106, z + ∆z = (x+ ∆x)/(y + ∆y) ≈ 106 + 103, so ∆z ≈ 103 ≈ ∆x y , while ∆x = 10−3, i.e., the absolute error in x/y is 106 times greater than the absolute error in x. 38 Chapter 2. LU Decomposition for Linear Systems Something to remember . . . When devising numerical algorithms, avoid steps where you divide by a number that is small in absolute value, if you can. (This ill-conditioned step in the algorithm may cause the algorithm to be numerically unstable, due to blow-up of the absolute error, see also Section 2.6.) 2.5.3 Conditioning of Solving a Linear System We investigate the conditioning of solving linear system A~x = ~b for ~x, given A and ~b. Example 2.32: Conditioning of A~x = ~b, case ∆A = 0, ∆~b 6= 0 We consider mathematical problem P: ~x = A−1~b = f(A,~b). We perturb A and ~b in (A+ ∆A)(~x+ ∆~x) = ~b+ ∆~b. For simplicity, we first consider the case that ∆A = 0, ∆~b 6= 0. In this case, we have A(~x+ ∆~x) = ~b+ ∆~b or A∆~x = ∆~b. We want to find an upper bound for κR = ‖∆~x‖ ‖~x‖ ‖∆~b‖ ‖~b‖ . (2.3) From ∆~x = A−1∆~b we have ‖∆~x‖ = ‖A−1∆~b‖ ≤ ‖A−1‖ ‖∆~b‖, (2.4) and from A~x = ~b we have ‖~b‖ = ‖A∆~x‖ ≤ ‖A‖ ‖~x‖, or 1 ‖~x‖ ≤ ‖A‖ ‖~b‖ . (2.5) Plugging Eqs. (2.4) and (2.5) into Eq. (2.3), we obtain the upper bound κR ≤ ‖A−1‖ ‖∆~b‖‖A‖ ‖~b‖ ‖∆~b‖ ‖~b‖ = ‖A‖ ‖A−1‖. (2.6) 2.5. Conditioning of a Mathematical Problem 39 Definition 2.33: Matrix Condition Number Let A ∈ Rn×n be a nonsingular matrix. Then κ(A) = ‖A‖ ‖A−1‖ is called the condition number of A. Theorem 2.34 Let A ∈ Rn×n be a nonsingular matrix. The following property holds: κ(A) = ‖A‖ ‖A−1‖ ≥ 1. Proof. This simply follows from 1 = ‖I‖ = ‖AA−1‖ ≤ ‖A‖ ‖A−1‖ = κ(A), for any vector-induced matrix norm. We see that the relative condition number of problem P : ~x = A−1~b is bounded above by the matrix condition number, ‖A‖ ‖A−1‖ (if we assume ∆A = 0). The matrix condition number also appears in a bound for κR for the general problem when both A and ~b are perturbed: Example 2.35: Conditioning of A~x = ~b, case ∆A 6= 0, ∆~b 6= 0 We consider mathematical problem P: ~x = A−1~b = f(A,~b), perturbing A and ~b as in (A+ ∆A)(~x+ ∆~x) = ~b+ ∆~b. It can be shown that κR = ‖∆~x‖ ‖~x‖ ‖∆A‖ ‖A‖ + ‖∆~b‖ ‖~b‖ ≤ κ(A) 1 1− τ , (2.7) if τ = ‖A−1‖ ‖∆A‖ < 1. We say that matrix A is ill-conditioned when κ(A) 1, and well-conditioned otherwise. Linear systems with a well-conditioned matrix can be solved accurately on computers (because rounding errors in the input do not disproportionally affect the computed result). Linear systems with ill- conditioned matrices, however, are prone to inaccurate numerical solutions on computers. For the 2-norm matrix condition number we have the following explicit for- mulas: 40 Chapter 2. LU Decomposition for Linear Systems Theorem 2.36 Let A ∈ Rn×n be a nonsingular matrix. Then κ2(A) = ‖A‖2 ‖A−1‖2 = √ λmax(AAT )√ λmin(AAT ) = σmax(A) σmin(A) . If A is symmetric, then κ2(A) = |λ|max(A) |λ|min(A) . 2.6 Stability of a Numerical Algorithm If a mathematical problem is well-conditioned, it should be possible in principle to obtain its solution accurately on a computer using finite-precision calculations. For ill-conditioned problems, on the contrary, this is precarious, since rounding errors in the input data or while the steps of the computation are performed, may easily lead to large inaccuracies in the computed approximate solution. But even for problems that are well-conditioned and that are in principle accurately computable using finite precision, it still depends on our choice of algorithm whether an accurate result is indeed obtained. Some algorithms use steps that are by themselves ill-conditioned, causing er- rors in those steps that may be magnified by error propagation and/or may ac- cumulate, leading to inaccurate results for an otherwise well-conditioned mathe- matical problem. When the problem itself is ill-conditioned such ill-conditioned steps tend to be unavoidable, but when the problem is well-conditioned, it is often possible to devise alternative algorithms that avoid these ill-conditioned steps and lead to an accurate result. We call algorithms that obtain accurate results for well-conditioned problems numerically stable algorithms. On the contrary, algorithms that lead to unnecessary accuracy loss for well-conditioned problems, e.g., because they employ avoidable ill-conditioned steps, are called numerically unstable algorithms. 2.6.1 A Simple Example of a Stable and an Unstable Algorithm Example 2.37: Stable Algorithm for the Roots of a Quadratic Polynomial Consider the following mathematical problem: P : compute the roots of p(x) = x2 − 400x+ 2. The solution of this problem is given with high accuracy by x1 ≈ 399.9950 ≈ 0.005000063. We assume the problem is well-conditioned (this can be shown). We illustrate the stability of two possible algorithms for computing the roots 2.6. Stability of a Numerical Algorithm 41 in the floating point number system F (β = 10, t = 4, L = −10, U = 10), with unit roundoff µ = 1 2 β−t+1 = 0.5 · 10−3 = 0.0005. Algorithm 1: We use the standard formulas for computing the roots of a quadratic polynomial ax2 + bx+ c = 0, i.e., x1,2 = −b±√b2 − 4ac 2a , or, in our case, for x2 + 2fx+ c = 0, we have x1,2 = −f ± √ d, with f = −200, c = 2, d = f2 − c. In our floating point system we have fl(200) = 200 fl(2002) = 2002 = 40 000 fl(2) = 2, or, using symbols, fl(f) = f fl(f2) = f2 fl(c) = c, so we will not explicitly write the fl(·) operation for f, f2 and c in what follows. For the discriminant d, we get fl(f2 − c) = fl(40 000− 2) = 40 000 = fl(f2), and we note that the contribution of c = 2 is lost in this operation due to rounding. So we get for the approximate roots x1 and x2 x1 = fl[f + fl( √ fl(f2 − c))] = fl[200 + fl( √ 40 000)] = fl[200 + 200] = 400, x2 = fl[f − fl( √ fl(f2 − c))] = fl[200− fl( √ 40 000)] = fl[200− 200] = 0, with relative errors x1 − x1 x1 ≈ 1.25 · 10−5 ≈ µ, x2 − x2 x2 = 1 µ. We see that the result for x2 is highly inaccurate: we obtain a relative error of 100%. We note that catastrophic cancellation has occurred in computing x2: all accuracy was lost in computing the difference between two almost equal numbers in the expression −f − √ f2 − c. (The contribution of c, which is essential for the relative accuracy of the solution, was entirely lost.) We say that Algorithm 1 is numerically unstable, in this case because it clearly contains an ill-conditioned step in which accuracy is lost. 42 Chapter 2. LU Decomposition for Linear Systems Algorithm 2: A more stable algorithm can be obtained as follows. We compute x1, the largest root in absolute value, as above, but we compute x2 using an alternative formula. Observe that x1x2 = c, because p(x) can be factored as p(x) = ax2 + bx+ c = a(x− x1)(x− x2). So we compute x2 from x2 = c x1 . This step is well-conditioned unless x1 is small. So we have x2 = fl ( c x1 ) = fl ( 2 400 ) = fl(0.005) = 0.005, with relative error x2 − x2 x2 ≈ 1.2 · 10−5 ≈ µ. Algorithm 2 is numerically stable (it avoids the ill-conditioned step). 2.6.2 Stability of LU Decomposition It can be shown that the standard LU decomposition algorithm is somewhat un- stable: the algorithm contains steps in which divisions occur by a small number, when pivot elements are used that are close to zero in absolute value. This may lead to large elements in the L and U factors, and may lead to inaccuracies, also for well-conditioned problems. For this reason, the following partial pivoting variant of LU decomposition is often used: in every stage (indexed by k) of Gaussian elimination, determine the pivot element in position (k, k) as follows. In column k (starting from position (k, k) and below) one determines the largest element in absolute value, and switches the row with the largest element with the current row k. As such, one chooses, in each stage, the pivot element with the largest absolute value. This extra operation is easy to implement and is computationally inexpensive (it does not change the asymptotic cost of the algorithm). The resulting algorithm tempers the growth of elements in L and U and is numerically more stable. 2.6. Stability of a Numerical Algorithm 43 Algorithm 2.38: LU decomposition with partial pivoting, kij version, in- place Input: A matrix A ∈ Rn×n Output: L and U stored in A, and a vector ~p storing the pivoting rows 1: ~p(k) = (1, 2, . . . , n)> . Initialise the permutation vector 2: for k = 1, . . . , n− 1 do 3: Determine µ with k ≤ µ ≤ n such that |A(µ, k)| = ‖A(k : n, k)‖∞ 4: if |A(µ, k)| < τ then 5: Break . Stop the loop if near zero pivot found 6: end if 7: Swap elements of the permutation vector ~p(k) and ~p(µ) 8: Swap the rows A(k, :) and A(µ, :) 9: rows = k + 1 : n . Update all the rows below k 10: A(rows, k) = A(rows, k)/A(k, k) 11: A(rows, rows) = A(rows, rows)−A(rows, k)A(k, rows) 12: end for It can be seen as follows that this type of stability problem cannot occur when applying Cholesky to SPD matrices A. The Cholesky algorithm decomposes SPD matrices A as A = L̂L̂T , after which A~x = ~b is solved by forward and backward substitutions L̂~y = ~b and L̂T~x = ~y. We consider the matrix 2-norm and recall that, for SPD matrices A, ‖A‖2 = √ λmax(AAT ) = √ λmax(A2) = λmax(A). For the Cholesky factor L̂ we obtain ‖L̂‖2 = √ λmax(L̂L̂T ) = √ λmax(A) = √ ‖A‖2, and, similarly, ‖L̂T ‖2 = √ ‖A‖2. This indicates that the matrix elements in A = L̂L̂T cannot grow strongly. Cholesky decomposition is numerically stable (without need for pivoting). 44 Chapter 2. LU Decomposition for Linear Systems Chapter 3 Least-Squares Problems and QR Factorisation 3.1 Gram-Schmidt Orthogonalisation and QR Factorisation In this chapter, we generally consider real rectangular matrices A with more rows than columns: A ∈ Rm×n with m ≥ n. As we will see in Section 3.3, such matrices arise in overdetermined linear systems (with more equations than unknowns), which may be solved in the least-squares (LS) sense. When solving a LS problem, it will be useful to construct an orthonormal basis for range(A) = span{~a1, . . . ,~an}, the vector space spanned by the columns of A = ~a1 . . . ~an . In this section we will consider the Gram-Schmidt algorithm to orthogonalise the columns of A, which will lead to the so-called QR factorisation of A. 3.1.1 Gram-Schmidt Orthogonalisation For now, we will assume that A ∈ Rm×n, with m ≥ n, has full rank, i.e., its columns are linearly independent. We seek to construct an orthonormal basis for range(A). We first recall the concept of expansion of a vector in an orthonormal basis. Example 3.1 Let {~e1, ~e2} be a standard orthonormal basis for R2, i.e., ~eTi ~ej = δij for all i, j. Then any ~x ∈ R2 can be expanded in the basis as ~x = (~eT1 ~x)~e1 + (~e T 2 ~x)~e2. In the Gram-Schmidt procedure, we begin by constructing an orthogonal set of vectors {~v1, . . . , ~vn} that spans range(A) = span{~a1, . . . ,~an}, by taking the 45 46 Chapter 3. Least-Squares Problems and QR Factorisation vectors ~ai and subtracting their components in the directions of the previous ~vj . For example, for the case where A has 3 columns (n = 3): ~v1 = ~a1 ~v2 = ~a2 − ~v T 1 ~a2 ‖~v1‖2 ~v1 ~v3 = ~a3 − ~v T 1 ~a3 ‖~v1‖2 ~v1 − ~vT2 ~a3 ‖~v2‖2 ~v2. In this chapter, all vector norms denote 2-norms. We then obtain the set of orthonormal vectors {~q1, . . . , ~qn} such that span{~q1, . . . , ~qn} = span{~a1, . . . ,~an}, by normalising the vectors ~vi to unit length: ~qi = ~vi ‖~vi‖ . For the n = 3 case, this results in ~q1 ‖~v1‖ = ~v1 = ~a1 ~q2 ‖~v2‖ = ~v2 = ~a2 − (~qT1 ~a2) ~q1 ~q3 ‖~v3‖ = ~v3 = ~a3 − (~qT1 ~a3) ~q1 − (~qT2 ~a3) ~q2. (3.1) We rewrite this as ~q1 r11 = ~a1 ~q2 r22 = ~a2 − r12 ~q1 ~q3 r33 = ~a3 − r13 ~q1 − r23 ~q2, which leads to the factorisation of matrix A as A = ~a1 ~a2 ~a3 = ~q1 ~q2 ~q3 r11 r12 r130 r22 r23 0 0 r33 = Q̂R̂. This factorisation A = Q̂R̂ of A is known as the reduced QR factorisation, see below. This leads to the following algorithm for Gram-Schmidt orthogonalisation: Algorithm 3.2: Gram-Schmidt Orthogonalisation Input: matrix A ∈ Rm×n Output: the factor matrices Q̂ and R̂ in the thin QR factorisation A = Q̂R̂ Q̂ = 0 R̂ = 0 for j=1:n do ~vj = ~aj for i=1:j-1 do r̂ij = ~̂q T i ~aj ~vj = ~vj − r̂ij ~̂qi end for r̂jj = ‖~vj‖ ~̂qj = ~vj/r̂jj end for 3.1. Gram-Schmidt Orthogonalisation and QR Factorisation 47 3.1.2 QR Factorisation We now consider the general case of A ∈ Rm×n, with m ≥ n, but where the columns of A are not necessarily linearly independent. Definition 3.3 Let A ∈ Rm×n. The reduced QR factorisation of A is given by A = Q̂R̂, (3.2) where Q̂ ∈ Rm×n has orthonormal columns: Q̂T Q̂ = In, and R̂ ∈ Rn×n is upper triangular. The n columns of Q̂ form an orthonormal basis for an n-dimensional subspace of Rm. It is possible to expand this to a basis of the entire Rm by expanding Q̂ on the right with m−n additional columns that contain m−n further orthonormal vectors in Rm, leading to the (full) QR factorisation: Definition 3.4 Let A ∈ Rm×n. The (full) QR factorisation of A is given by A = QR = Q [ R̂ 0 ] , (3.3) where Q ∈ Rm×m is an orthogonal matrix: QTQ = Im = QQ T , and R̂ ∈ Rn×n is upper triangular. Theorem 3.5 Every A ∈ Rm×n has a full QR factorisation A = QR, and hence also a reduced QR factorisation A = Q̂R̂. This can be shown, for the reduced QR factorisation, using the observation that in the Gram-Schmidt algorithm, if a zero ~vj is obtained and ~qj cannot be computed, one can instead choose any vector ~qj that is orthonormal with respect to the previous ~qi (for example, by repeating the orthogonalisation step for determining ~vj starting from a random vector ~aj ∈ Rm, instead of the original jth column ~aj of A). For the full QR factorisation, the additional orthogonal columns of Q can be determined in a similar manner. 3.1.3 Modified Gram-Schmidt Orthogonalisation It can be observed in example computations, and shown theoretically, that the Gram-Schmidt algorithm is numerically unstable. If the orthonormal basis is computed as in Eq. (3.1), the resulting vectors ~qi may suffer from loss of orthog- onality due to rounding errors. The stability can be improved substantially by the following small modifi- cation to the algorithm. For example, for ~v3 in Eq. (3.1), one subtracts the 48 Chapter 3. Least-Squares Problems and QR Factorisation component in the direction of ~q2 by projecting the original column vector ~a3 onto ~q2. Even though in exact arithmetic ~q1 is orthogonal to ~q2 and the com- ponent of ~a3 in the direction of ~q2 is equal to the component of ~a3 − (~qT1 ~a3) ~q1 in the direction of ~q2, it turns out that ~a3 − (~qT1 ~a3) ~q1 may have a slightly dif- ferent component in the direction of ~q2 due to rounding, and it is better for stability to subtract the component of ~a3 − (~qT1 ~a3) ~q1 in the direction of ~q2. In a similar manner we repeatedly subtract, as terms are added to determine each ~vj , the components in direction ~qi of the intermediate result for ~vj , instead of the components of ~aj in direction ~qi. This results in the following modified Gram-Schmidt algorithm: Algorithm 3.6: Modified Gram-Schmidt Orthogonalisation Input: matrix A ∈ Rm×n Output: the factor matrices Q̂ and R̂ in the thin QR factorisation A = Q̂R̂ Q̂ = 0 R̂ = 0 for j=1:n do ~vj = ~aj for i=1:j-1 do r̂ij = ~̂q T i ~vj ~vj = ~vj − r̂ij ~̂qi end for r̂jj = ‖~vj‖ ~̂qj = ~vj/r̂jj end for It can be shown that this modified version is substantially more stable than the original Gram-Schmidt procedure, but for ill-conditioned problems loss of orthogonality can still occur and a more stable approach is desired. In the next section we consider a procedure using orthogonal transformations of Householder reflection type that will accomplish this goal. 3.2 QR Factorisation using Householder Transformations Since the Gram-Schmidt orthogonalisation and its modified version are deficient in terms of their numerical stability problems, we seek a more stable approach to compute the QR decomposition. It turns out that an approach based on applying orthogonal transformations to A results in a method with more favourable stability properties. One reason why such methods have good stability properties is that multiplying A with an orthogonal matrix Q preserves the Euclidean length of the columns of A: Theorem 3.7 Orthogonal matrices preserve Euclidean length. 3.2. QR Factorisation using Householder Transformations 49 Proof. Let Q ∈ Rn×n with QTQ = I. Suppose ~y = Q~x. Then ‖~y‖ = ‖Q~x‖ = √ (Q~x)TQ~x = √ ~xTQTQ~x = √ ~xT~x = ‖~x‖, where, as in the rest of this chapter, ‖ · ‖ indicates the vector 2-norm. This means that the matrix element sizes of QA cannot be much larger than those of A. A useful further property of orthogonal matrices is the following: Theorem 3.8 The product of orthogonal matrices is orthogonal. Proof. Let Q = Q1Q2, where Q1, Q2 ∈ Rn×n are orthogonal. Then QTQ = (Q1Q2) TQ1Q2 = Q T 2 Q T 1 Q1Q2 = I 3.2.1 Householder Reflections We want to transform A ∈ Rm×n (with m ≥ n) into an upper-triangular matrix R ∈ Rm×n by applying orthogonal transformations to A. Our approach will be to multiply A by a sequence of orthogonal transfor- mation matrices Qj that create zeros in column j below the element in position (j, j). This aim is similar to LU decomposition, but we know that each or- thogonal transformation preserves the Euclidean length of the matrix columns it operates on. Let’s consider the first orthogonal transformation, Q1 ∈ Rm×m, which is applied to A = ~a1 . . . ~an . such that Q1A has zeros in its first column below the first element. Since the length of column ~a1 is preserved, we know that the first element in the trans- formed column has to be ±‖~a1‖: Q1A = ±‖~a1‖ ~rT1 0 A˜2 . 50 Chapter 3. Least-Squares Problems and QR Factorisation We choose for now a transformation Q1 that results in a transformed first column with a negative value as its first element: Q1~a1 = −‖~a1‖ 0 ... 0 . The specific type of transformation we choose for Q1 (and all subsequent Qjs) is a reflection in Rm about a hyperplane that is orthogonal to the line from ~a1 to Q1~a1 and intersects the line in the middle between ~a1 and Q1~a1. This reflection operation is called a Householder reflection. Let ~v1 be the vector pointing from Q1~a1 to ~a1: ~v1 = ~a1 −Q1~a1, and let ~u1 be the unit vector in that direction: ~u1 = ~v1 ‖~v1‖ . The vector ~u1 is called a Householder vector. The operation of the Householder reflection Q1 onto a vector ~x ∈ Rm can then be expressed as Q1~x = ~x− 2(~uT1 ~x)~u1, and, since (~uT1 ~x)~u1 = (~u1~u T 1 )~x, the matrix form of the Householder orthogonal transformation is given by Q1 = Im − 2~u1~uT1 . Theorem 3.9 Let ~u ∈ Rm with ‖~u‖ = 1. Then the Householder reflection matrix Q~u = Im − 2~u~uT is a symmetric and orthogonal matrix. Proof. Clearly, QT~u = Q~u.Then QT~uQ~u = Q 2 ~u = (Im − 2~u~uT )(Im − 2~u~uT ) = Im − 4~u~uT + 4~u(~uT~u)~uT = Im. Finally, we note that the sign in Q1~a1 = ±‖~a1‖ 0 ... 0 3.2. QR Factorisation using Householder Transformations 51 is chosen in practical implementations based on numerical stability concerns. For numerical stability reasons, we choose the sign of ±‖~a1‖ opposite to the sign of the first component of the original column ~a1 of A, (~a1)1: this avoids catastrophic cancellation in computing ~v1 = ~a1−Q1~a1 that may otherwise arise when |(~a1)1| ≈ ‖~a1‖. In other words, we choose the sign of ±‖~a1‖ such that the size of ~v1 is as large as possible. 3.2.2 Using Householder Reflections to Compute the QR Factorisation We now use a sequence of n Householder reflections to compute the QR decom- position of A ∈ Rm×n. The first transformation creates the desired zeros in the first column of A: Q1A = r11 ~r T 1 0 A˜2 ∈ Rm×n, and is followed by a second orthogonal transformation of Householder type that creates zeros in the first column of A˜2 ∈ R(m−1)×(n−1): Q˜2A˜2 = r22 ~r T 2 0 A˜3 ∈ R(m−1)×(n−1). Defining Q2 = 1 0 0 Q˜2 ∈ Rm×m, these steps can be combined as Q2Q1A = r11 ~r T 1 0 Q˜2A˜2 = r11 ~rT1 0 r22 ~r T 2 0 A˜3 ∈ Rm×n, and so on, with, in the next step, Q˜3A˜3 = r33 ~r T 3 0 A˜4 ∈ R(m−2)×(n−2), and Q3 = I2 0 0 Q˜3 ∈ Rm×m, 52 Chapter 3. Least-Squares Problems and QR Factorisation etc. After n transformations this results in QnQn−1 . . . Q2Q1A = [ R̂ 0 ] , where R̂ ∈ Rn×n is upper triangular, and QT = QnQn−1 . . . Q2Q1 is an orthogonal matrix. Finally, the QR factorisation of A results as A = Q1Q2 . . . Qn−1Qn [ R̂ 0 ] = QR. 3.2.3 Computing Q In many cases, forming the m×m matrix Q is not needed explicitly. For example, if only matrix-vector products Q~x are required, one can save the Householder vectors ~ui (i = 1, . . . , n) and evaluate Q~x as Q~x = Q1Q2 . . . Qn−1Qn ~x. If Q is desired explicitly, there are several options for constructing it: • The transpose of Q can be formed as the loop over the columns of A progresses: QT = QnQn−1 . . . Q2Q1Im, starting with the Q1 multiplication, and then Q2, etc., and Q can be ob- tained by taking the transpose at the end. This is the approach used in the pseudocode for the QR decomposition by Householder reflections below. However, this approach is more costly than necessary because Q1 is typi- cally dense and does not have leading rows that are zero below the diagonal. Therefore, all subsequent Householder reflections with Q˜2, . . . , Q˜n−1 need to be carried out on all n columns of the relevant rows of the intermediate result (in forming R, in contrast, the transformations do not need to be carried out on the leading zero columns). The reverse order used in the next option avoids these extra flops. • One can store the Householder vectors ~ui (i = 1, . . . , n) and form Q at the end as Q = Q1Q2 . . . Qn−1QnIm, starting with the Qn multiplication, and then Qn−1, etc. This is more efficient since the Q˜k don’t need to be applied to the leading columns of the intermediate results that are zero below the diagonal. This is pseudocode for computing the QR decomposition by Householder reflections: 3.2. QR Factorisation using Householder Transformations 53 Algorithm 3.10: QR Factorisation using Householder Transformations Input: matrix A ∈ Rm×n Output: the factor matrices Q and R in the (full) QR factorisation A = QR 1: R = A 2: Qt = Im . Qt will be the transpose of Q 3: for k=1:n do . first determine the Householder vector ~uk 4: ~x = R(k : m, k) 5: ~y = zeros(m− k + 1, 1) 6: if x1 < 0 then 7: y1 = ‖~x‖ 8: else 9: y1 = −‖~x‖ 10: end if 11: ~v = ~x− ~y 12: ~uk = ~v/‖~v‖ . apply the Householder transformation to the relevant part of R 13: R(k : m, k : n) = R(k : m, k : n)− 2~uk (~uTkR(k : m, k : n)) . finally, update Qt (note: we need *all* columns here!) 14: Qt(k : m, 1 : m) = Qt(k : m, 1 : m)− 2~uk (~uTkQt(k : m, 1 : m)) 15: end for 16: Q = QTt The following more compact version of the pseudocode computes QR Fac- torisation using Householder Transformations without forming Q and perform operations in-place. Algorithm 3.11: QR Factorisation using Householder Transformations (without forming Q) Input: A matrix A ∈ Rm×n, m > n Output: The factor matrix R ∈ Rn×n and a sequence of vectors ~uk, k = 1, . . . , n that defines the sequence of unitary similarity transformations. 1: for k = 1, . . . , n do 2: ~b = A(k:m, k) 3: ~v = ~b+ sign(b1) ‖~b‖~e1 4: ~uk = ~v/‖~v‖ 5: A(k, k) = −sign(b1)‖~b‖ 6: A(k+1:m, k) = 0 7: A(k:m, k+1:n) = A(k:m, k+1:n)− (2~uk) ( ~u>k A(k:m, k+1:n) ) 8: end for 3.2.4 Computational Work When implementing the Householder algorithm, it is essential to implement the reflection by first computing the vector ~zTk = ~u T kR(k : m, k : n) in R(k : m, k : n) = R(k : m, k : n)− 2~uk (~uTkR(k : m, k : n)), 54 Chapter 3. Least-Squares Problems and QR Factorisation rather than first constructing the rank-1 matrix ~uk ~u T k and multiplying it with R(k : m, k : n), which is much more expensive. When implemented in this order, it can be shown that the dominant terms in the computational work are given by W ≈ 2mn2 − 2 3 n3 flops. Notes: • For the case of square matrices, m = n, we have W ≈ 2n3 − 2 3 n3 = 4 3 n3 flops, which is twice the work of LU decomposition. • The QR decomposition can be used to solve linear systems as follows: 1. Compute Q and R in A = QR, e.g., using Householder transforma- tions. 2. The system A~x = ~b can be solved by backward substitution as can be seen from the following equivalences: A~x = ~b QR~x = ~b QTQR~x = QT~b R~x = QT~b. Solving the system in this way is more stable than using the LU decomposition, but comes at twice the cost. • The QR decomposition using Householder transformations is really useful for solving least-squares problems in a numerically stable way, as will be explained in the next sections. 3.3 Overdetermined Systems and Least-Squares Problems Let A ∈ Rm×n, where m > n. Such overdetermined linear systems, where there are more equations (m) than unknowns (n), are common in applications. Example 3.12 Consider the linear regression problem of finding the “best” linear relation y(t) = c t + d between m observations {(ti, yi)} for i = 1, . . . ,m. We aim to solve the following linear system t1 1 t2 1 … … tm 1 ( c d ) = y1 y2 … ym . (3.4) However, the above linear system may be overdetermined and does not have a solution. 3.3. Overdetermined Systems and Least-Squares Problems 55 Exact solutions do not generally exist for overdetermined systems A~x = ~b, A ∈ Rm×n, m > n. Instead, one can seek the “optimal” ~x that minimizes the residual vector ~r = ~b−A~x ∈ Rm in some norm. A popular choice for the norm, which can be justified, e.g., in statistical applications, is the 2-norm. This leads to the following definition of an overdetermined linear least-squares (LS) problem: Definition 3.13: Least-Squares Problem Let A ∈ Rm×n with m > n. Find ~x that minimizes f(~x) = ‖~b−A~x‖22. Note that f(~x) = ‖~b−A~x‖22 = ‖~r‖22 = m∑ k=1 r2k, which explains that the solution is indeed sought that provides the least sum of squares of the residual components. 3.3.1 The Normal Equations – A Geometric View Let A ∈ Rm×n with m > n. The columns of A span a subspace of Rm. The solution of the LS problem is the vector ~x ∈ Rn such that the vector A~x ∈ range(A) is the best approximation of~b in range(A), in the sense that ~xminimises the residual, ~r = ~b−A~x. The residual ~r = ~b−A~x is minimal if it is orthogonal to range(A) (or, equivalently, if A~x is the orthogonal projection of~b onto range(A)). If we specify this geometric condition, we find a linear system of equations that specifies the solution of the LS problem: ~r ⊥ A~z ∀~z ⇐⇒ (A~z)T~r = 0 ∀~z ⇐⇒ ~zTAT (~b−A~x) = 0 ∀~z ⇐⇒ AT~b−ATA~x = 0 ⇐⇒ ATA~x = AT~b. where ATA ∈ Rn×n. The equations ATA~x = AT~b are called the normal equations, the first way to compute the LS solution. One problem in this approach is that ATA can be ill-conditioned, more so than A (see below). 3.3.2 The Normal Equations The following theorem shows that linear least-squares problems can be solved by finding the solution of a square linear system with matrix ATA. 56 Chapter 3. Least-Squares Problems and QR Factorisation Theorem 3.14 Let A ∈ Rm×n with m > n. 1. Any minimiser of f(~x) = ‖~b−A~x‖22 satisfies ATA~x = AT~b. 2. Any solution of the normal equations is a minimiser of f(~x). 3. If A has linearly independent columns, then ATA~x = AT~b (and the least- squares problem) has a unique solution. Proof. 1. Consider f(~x) = m∑ k=1 r2k = m∑ k=1 (bk − ( n∑ j=1 akjxj)) 2. If ~x is a minimiser of f(~x), then ~x satisfies the optimality equations ∂f ∂xi = 0 (i = 1, . . . , n). This gives ∂f ∂xi = m∑ k=1 2aki(bk − ( n∑ j=1 akjxj)) = 0 (i = 1, . . . , n), or m∑ k=1 n∑ j=1 akiakjxj = m∑ k=1 akibk (i = 1, . . . , n). It is easy to see that this corresponds to ATA~x = AT~b. (Check!) Note that solutions of this equation could also be maximisers of f(~x), which we exclude in the next part. 2. Let ~x satisfy ATA~x = AT~b, and ~r = ~b−A~x. Then f(~x+ ~u) ≥ f(~x) ∀~u ∈ Rn×n, since f(~x+ ~u) = (~b−A(~x+ ~u))T (~b−A(~x+ ~u)) = (~r −A~u)T (~r −A~u) = ~rT~r − ~rTA~u− ~uTAT~r + ~uTATA~u = ~rT~r − 2~uTAT~r + ~uTATA~u = f(~x) + ‖A~u‖22. 3. If A has linearly independent columns, then A~x 6= 0 for all ~x 6= 0. Therefore, ~xTATA~x = ‖A~x‖22 > 0 for all ~x 6= 0 and ATA is SPD. This implies that ATA is nonsingular, so ATA~x = AT~b has a unique solution. 3.4. Solving Least-Squares Problems using QR Factorisation 57 Note: If A has linearly dependent columns, then ATA is singular and the normal equations have infinitely many solutions. 3.3.3 Computational Work for Forming and Solving the Normal Equations If A ∈ Rm×n with m > n has linearly independent columns, then the LS solution of A~x = ~b can be computed efficiently using Cholesky decomposition applied to the normal equations, since ATA ∈ Rn×n is SPD. The dominant terms in the computational work, including the cost of forming ATA, are W ≈ n3/3+n2(2m− 1) flops, where the cost of forming ATA dominates strongly for m n. 3.3.4 Numerical Stability of Using the Normal Equations Regarding conditioning, we have for general non-symmetric square A ∈ Rn×n κ2(A) = σmax(A) σmin(A) = √ λmax(ATA)√ λmin(ATA) . This can be extended to rectangular matrices A ∈ Rm×n with m > n and linearly independent columns, using the same expressions for κ2 (since λi(A TA) > 0 for all i = 1, . . . , n in this case). The condition number of the matrix ATA arising in the normal equations is given by κ2(A TA) = σmax(A TA) σmin(ATA) = √ λmax((ATA)TATA)√ λmin((ATA)TATA) . Since σmax(A TA) = √ λmax((ATA)TATA) = √ λmax((ATA)2) = λmax(A TA), σmin(A TA) = √ λmin((ATA)TATA) = √ λmin((ATA)2) = λmin(A TA), we obtain κ2(A TA) = κ22(A). This indicates that solving the normal equations squares the condition number of the original matrix A, and may thus be ill-conditioned. In the next section we will see how the QR decomposition of A, e.g. using Householder transformations, can be used to solve the LS problem. This avoids the squaring of the condition number of A, and is more numerically stable than solving the normal equations. 3.4 Solving Least-Squares Problems using QR Factorisation We consider overdetermined system A~x = ~b with A ∈ Rm×n (m ≥ n),~b ∈ Rm, ~x ∈ Rn. We seek to solve the system in the least-squares sense, i.e., we minimize ‖~r‖2 = ‖~b−A~x‖2. (3.5) 58 Chapter 3. Least-Squares Problems and QR Factorisation We use the (full) QR decomposition of A, with A = Q [ R̂ 0 ] , where Q = [ Q̂|Q¯ ] ∈ Rm×m and Q̂ ∈ Rm×n and R̂ ∈ Rn×n. The factors Q̂ and R̂ can be obtained using the Householder algorithm. Then we observe that ‖~r‖22 = ‖QT~r‖22 = ‖QT (~b−A~x)‖22 = ∥∥∥∥QT (~b−Q [R̂0 ] ~x )∥∥∥∥2 2 = ∥∥∥∥∥ [ Q̂T~b Q¯T~b ] − [ R̂~x 0 ]∥∥∥∥∥ 2 2 = ‖Q̂T~b− R̂~x‖22 + ‖Q¯T~b‖22︸ ︷︷ ︸ indep. of ~x . Thus ‖~r‖22 is minimal when Q̂T~b− R̂~x = 0 or R̂~x = Q̂T~b. (3.6) We solve this n×n system by backward substitution to find the optimal ~x. This is numerically more stable than solving the normal equations. 3.4.1 Geometric Interpretation in Terms of Projection Matrices Equation (3.6) can be interpreted geometrically as follows. We know that the vector ~x minimising Eq. (3.5) satisfies A~x = the orthogonal projection of ~b onto range(A). The columns of Q̂ = ~q1 . . . ~qn form an orthogonal basis of range(A). The product Q̂T~b = ~qT1 ~b. . . ~qTn ~b contains the projection coefficients of ~b onto the basis vectors ~qi. Then Q̂Q̂T~b = (~qT1 ~b)~q1 + . . .+ (~q T n ~b)~qn = (~q1~q T 1 ) ~b+ . . .+ (~qn~q T n ) ~b is the orthogonal projection of ~b onto range(A). So we conclude that the LS solution ~x satisfies A~x = Q̂Q̂T~b Q̂R̂~x = Q̂Q̂T~b Q̂T Q̂R̂~x = Q̂T Q̂Q̂T~b 3.5. Alternating Least-Squares Algorithm for Movie Recommendation 59 or R̂~x = Q̂T~b since Q̂T Q̂ = Im. This is a geometric way to derive result (3.6). Note that the matrix P = Q̂Q̂T ∈ Rm×m is an orthogonal projection matrix, since it satisfies P 2 = P and PT = P . The matrix-vector product Q̂Q̂T~z projects any vector ~z ∈ Rm orthogonally onto range(A). The orthogonal projector Q̂T Q̂ = ~q1~q T 1 + . . .+ ~qn~q T n is composed of the sum of n rank-one orthogonal projection matrices Pi = ~qi~q T i ∈ Rm×m, with each Pi satisfying P 2 i = Pi and P T i = Pi. 3.5 Alternating Least-Squares Algorithm for Movie Recommendation Continuing the discussion on algorithms for movie recommendation from Section 1.3, we now proceed with formulating a least-squares-based optimisation algo- rithm to compute matrices U ∈ Rf×m and V ∈ Rf×n, with f a small integer m,n, such that UTM approximates the ratings matrix R on the set of known ratings, R: R ≈ UT M . (3.7) In particular, we seek U and M that minimise g(U,M) = ‖R− UTM‖2F,R + λN(U,M), (3.8) where the ‖·‖F,R norm is a partial Frobenius norm, summed only over the known entries of R, as given by the index set R, and N(U,M) is a regularisation term. We will now explain the details of the Alternating Least Squares (ALS) algo- rithm for solving minimisation problem (3.8). The algorithm determines U and M in an alternating fashion: starting from an initial guess for U , determine the optimal M with U fixed, then determine the optimal U with M fixed, and so forth. Each subproblem of determining M with fixed U (and vice versa) in this alternating algorithm boils down to a (regularized) linear least-squares problem, as we will now describe. 60 Chapter 3. Least-Squares Problems and QR Factorisation 3.5.1 Least-Squares Subproblems for Movie Recommendation For each user i, let Ji = {j1, j2, j3, . . .} be the set of the indices j of the movies ranked by user i, and for each movie j, let Ij = {i1, i2, i3, . . .} be the set of the indices i of the users who have ranked movie j. Let |Ji| be the number of movies ranked by user i, and let |Ij | be the number of users who have ranked movie j. Then the function (3.8) we want to minimise is given specifically by min U,M g(U,M) = ∑ (i,j)∈R ( rij − ~uTi ~mj )2 + λ m∑ i=1 |Ji| ‖~ui‖22 + n∑ j=1 |Ij | ‖~mj‖22 , (3.9) with λ a fixed regularisation parameter. We first rewrite the first part of the objective function g(U,M) as a sum over all movies: min U,M g(U,M) = n∑ j=1 ‖~rj − UT ~mj‖22,Ij + λ m∑ i=1 |Ji| ‖~ui‖22 + n∑ j=1 |Ij | ‖~mj‖22 , (3.10) where ‖ · ‖2,Ij is a partial 2-norm, summed only over the vector entries that correspond to users who have ranked movie j, as given by the index set Ij . That is, ‖~rj − UT ~mj‖22,Ij = ‖~rIj − UTIj ~mj‖22, where ~rIj is the vector containing all the known ratings for movie j (the elements of column ~rj of R that contain ratings, by the users in the index set Ij), and UTIj is a submatrix of the user matrix UT that contains only the rows of the users that have ranked movie j. We rewrite Eq. (3.10) as min U,M g(U,M) = n∑ j=1 ‖~rIj − UTIj ~mj‖22 + λ m∑ i=1 |Ji| ‖~ui‖22 + n∑ j=1 |Ij | ‖~mj‖22 . (3.11) In the first half of an ALS iteration, we fix U , and find the optimal M given that fixed U . To this end, we set the gradient of g(U,M) with respect to the elements of M equal to zero. It is convenient to express this for each of the columns ~mj of M : ∇~mjg(U,M) = ∇~mj‖~rIj − UTIj ~mj‖22 + λ|Ij | ∇~mj‖~mj‖22 = 0 (j = 1, . . . , n). (3.12) These are n independent (regularised) linear least-squares problems for the n columns ~mj of movie matrix M (with fixed user matrix U). To compute the gradients in these expressions, the proof of Theorem 3.14 shows that ∇~x‖~b−A~x‖22 = −2AT (~b−A~x) = 2(ATA~x−AT~b), 3.5. Alternating Least-Squares Algorithm for Movie Recommendation 61 and we also have (e.g., as a special case of the above) that ∇~x‖~x‖22 = 2~x. Applying these to Eq. (3.12) gives the n (regularised) normal equation conditions 2(UIjU T Ij ~mj − UIj~rIj ) + 2λ|Ij |~mj = 0, (UIjU T Ij + λ|Ij |I)~mj = UIj~rIj (j = 1, . . . , n). (3.13) Solving these small f × f linear systems for the columns ~mj of M (which can be done in parallel) updates M in the first half of an ALS iteration. The second half of the ALS iteration fixes M and updates U in a manner completely analogous to the first half of the iteration. Specifically, we define the transpose, Q, of the ratings matrix R, Q = RT and write Q ≈MTU i.e., Q ≈ MT U . (3.14) We rewrite Eq. (3.9) as min U,M g(U,M) = m∑ i=1 ‖~qJi −MTJi~ui‖22 + λ m∑ i=1 |Ji| ‖~ui‖22 + n∑ j=1 |Ij | ‖~mj‖22 , (3.15) where ~qJi is the vector containing all the known ratings given by user i (the elements of column ~qi of Q that contain ratings, for the movies in the index set Ji), and MTJi is a submatrix of the movie matrix MT that contains only the rows of the movies that are ranked by user i. Setting the gradient with respect to the elements of U equal to zero, column- by-column, gives ∇~uig(U,M) = ∇~ui‖~qJi −MTJi~ui‖22 + λ|Ji| ∇~ui‖~ui‖22 = 0 (i = 1, . . . ,m). (3.16) This gives the m (regularised) normal equations (MJiM T Ji + λ|Ji|I)~ui = MJi~qJi (i = 1, . . . ,m). (3.17) Solving these small f × f linear systems for the columns ~ui of U (which, again, can be done in parallel) updates U in the second half of an ALS iteration. 62 Chapter 3. Least-Squares Problems and QR Factorisation Chapter 4 The Conjugate Gradient Method for Sparse SPD Systems In this Chapter we will consider iterative methods for solving linear systems A~x = ~b, for the specific case where A ∈ Rn×n is a symmetric positive definite (SPD) matrix. When using direct solvers for linear systems A~x = ~b such as Gaussian elimina- tion / LU decomposition and Cholesky decomposition, the algorithm is executed until completion at which time one obtains the exact solution (in exact arith- metic), and the algorithm does not generate approximate solutions along the way. In contrast, iterative methods start from an initial guess ~x0 for the solution that one seeks to improve in a sequence of approximations ~x0, ~x1, ~x2, ~x3, . . . until some convergence criterion is attained that typically prescribes a desired accuracy in the approximation. Iterative methods can be advantageous in terms of computational cost, in particular for large-scale problems that involve highly sparse matrices. For ex- ample, the matrix A ∈ Rn×n in our 2D model problem has about 5 nonzeros per row. The cost for a matrix-vector product is therefore O(n) flops (in particular, about 9n flops). The cost per iteration of iterative solvers is often proportional to the cost of a matrix-vector product. So if an iterative solver can solve A~x = ~b up to a desired accuracy in a number of iterations that does not grow strongly with n, then it can often beat direct solvers. For example, for the 2D model problem, iterative solvers exist, with O(n) cost per iteration, that converge to the accuracy with which the PDE was discretised in a number of iterations that does not grow with problem size. Those iterative solvers can obtain an accurate answer in O(n) work, which, for large problems, is much faster than the O(n3) cost of LU decomposition, or the O(n2) cost of banded LU decomposition. In this chapter we will start explore such iterative methods for solving A~x = ~b, for the particular case that A is SPD, which arises in many applications. 63 64 Chapter 4. The Conjugate Gradient Method for Sparse SPD Systems 4.1 An Optimisation Problem Equivalent to SPD Linear Systems Theorem 4.1 Let A ∈ Rn×n be an SPD matrix. Then φ(~x) = 1 2 ~xTA~x−~bT~x+ c (4.1) with c an arbitrary constant, has a unique global minimum, ~x∗, which is the unique solution of A~x = ~b. Proof. Since A is SPD, it is nonsingular and A~x = ~b has a unique solution, which we call ~x∗. Given an approximation ~x of ~x∗, we define the error ~e by ~e = ~x∗ − ~x. Considering the A-norm of the error, ~e = ~x∗ − ~x, we find ‖~x∗ − ~x‖A = ‖~e‖A = ~eTA~e = (~x∗ − ~x)TA(~x∗ − ~x) = ~x∗TA~x∗ − ~x∗TA~x− ~xTA~x∗ + ~xTA~x = ~x∗TA~x∗ − 2~xTA~x∗ + ~xTA~x sinceA = AT = ~x∗T~b− 2~xT~b+ ~xTA~x = ~x∗T~b+ 2φ(~x). Since taking ~x = ~x∗ uniquely minimises the LHS of this equality, ~x∗ is also the unique minimiser of the RHS, and hence of φ(~x), because ~x∗T~b is independent of ~x. This shows that φ(~x) has a unique global minimiser, which is the solution of A~x = ~b. 4.2 The Steepest Descent Method The first iterative method we consider here for solving A~x = ~b is based on a basic optimisation method for solving the optimisation problem min ~x φ(~x). Recall that the gradient of φ(~x), ∇φ(~x), points in the direction of steepest ascent of φ(~x), and is orthogonal to the level surfaces of φ(~x). The direction of steepest descent is given by −∇φ(~x). In the case of φ(~x) corresponding to the 4.2. The Steepest Descent Method 65 SPD linear system A~x = ~b (Eq. (4.1)), we find −∇φ(~x) = −(A~x−~b) = ~r, where the residual ~r is define as ~r = ~b−A~x. Here, we have used that ∇(~xTA~x) = A~x+AT~x, or ∇(~xTA~x) = 2A~x when A is symmetric. The steepest descent optimisation method proceeds as follows. Suppose we are given an initial approximation ~x0. We seek a new, improved approxima- tion ~x1 by considering φ(~x) along a line in the direction of steepest descent, −∇φ(~x0) = ~r0, where we define, for approximation ~xi, ~ri = ~b−A~xi. That is, we determine the next approximation ~x1 of the form ~x1 = ~x0 + α1~r0, where ~r0 = −∇φ(~x0) is called the search direction. Considering ~x1(α1) as a function of α1, we determine the optimal step length α1 from ~x0 along the search direction from the condition d dα1 φ(~x1(α1)) = 0, which leads to 0 = d dα1 φ(~x1(α1)) = ∇φ(~x1)T d~x1 dα1 = −~rT1 ~r0. This has the natural interpretation that the optimal step length is obtained at the point ~x1 where the line on which we seek the new approximation is tangent to a level surface, i.e., in the new point ~x1 the new gradient, −~r1 is orthogonal to the search direction ~r0. This condition leads to an expression for the optimal step length as follows: 0 = −~rT1 ~r0 = −(~b−A~x1)T~r0 = −(~b−A(~x0 + α1~r0)T~r0 = −(~r0 − α1A~r0)T~r0, 66 Chapter 4. The Conjugate Gradient Method for Sparse SPD Systems or α1 = ~rT0 ~r0 ~rT0 A~r0 . This process is repeated to determine ~x2, ~x3, . . ., until a stopping criterion is satisfied. -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -0.5 0 0.5 kappa:2 steps:13 Figure 4.2.1: Steepest descent convergence pattern for matrix (4.2) with λ = 2 and κ2(A) = 2, from initial guess ~x0 = (−1, 0.5)T . Algorithm 4.2: Steepest Descent Method for A~x = ~b, A SPD Input: matrix A ∈ Rn×n, SPD; initial guess ~x0 Output: sequence of approximations ~x1, ~x2, . . . ~r0 = ~b−A~x0 k = 0 repeat k = k + 1 αk = (~r T k−1~rk−1)/(~r T k−1A~rk−1) ~xk = ~xk−1 + αk~rk−1 ~rk = ~rk−1 − αkA~rk−1 until convergence criterion is satisfied The cost per iteration of the steepest descent algorithm consists of one matrix- vector product, two scalar products of vectors in Rn, and two so-called axpy operations, denoting operations of type a~x+ ~y with vectors in Rn. If A is sparse with nnz(A) = O(n), then the cost of one steepest descent iteration is O(n). It can be shown that, if A is SPD, convergence to ~x∗ is guaranteed from any initial guess. However, convergence can take many iterations, as illustrated in the following example. 4.2. The Steepest Descent Method 67 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -0.5 0 0.5 kappa:20 steps:139 Figure 4.2.2: Steepest descent convergence patterns for matrices (4.2) with λ = 20 and λ = 200, from initial guess ~x0 = (−1, 0.05)T and ~x0 = (−1, 0.005)T , respectively. Example 4.3 We consider solving A~x = ~b with SPD matrix A = [ 1 0 0 λ ] , (4.2) where λ > 1, and with ~b = 0, i.e., the solution ~x∗ = (0, 0)T . Since A is SPD, 68 Chapter 4. The Conjugate Gradient Method for Sparse SPD Systems we have κ2(A) = λmax(A) λmin(A) = λ. We first consider the case that λ = 2, i.e., κ2(A) = 2. Fig. 4.2.1 shows level curves of φ(~x), which are ellipses aligned with the coordinate axes. The figure shows the steepest descent convergence pattern starting from initial guess ~x0 = (−1, 0.5)T . The convergence criterion ‖~ri‖ ‖~r0‖ ≤ 10 −6 is satisfied after 13 steps. However, Fig. 4.2.2 shows that, when increasing λ and κ2(A) to 20 and 200, the number of iterations grows strongly: as κ2(A) increases, the level curves become more elongated and, depending on the choice of the initial condition, this may result in extreme zig-zag patterns. For example, when κ2(A) = 200, the method requires more than 1,300 iterations. (Note, on the contrary, that the exact solution is obtained in one step if κ2(A) = 1, in which case the level curves are circles and the normal from any point is directed exactly towards to the origin, which is the solution of the problem.) This example shows that, for the steepest descent method, the number of iterations required for convergence may increase proportionally to the matrix condition number, κ. Since in many examples the condition number grows as a function of problem size, this behaviour is clearly undesirable for the large- scale problems we target. Therefore, we seek iterative methods with improved convergence behaviour. The conjugate gradient method of the next section offers such an improvement. 4.3 The Conjugate Gradient Method Let A ∈ Rn×n be SPD. The Conjugate Gradient (CG) method for A~x = ~b is given by: 4.3. The Conjugate Gradient Method 69 Algorithm 4.4: Conjugate Gradient Method for A~x = ~b, A SPD Input: matrix A ∈ Rn×n, SPD; initial guess ~x0 Output: sequence of approximations ~x1, ~x2, . . . 1: ~r0 = ~b−A~x0 2: ~p0 = ~r0 3: k = 0 4: repeat 5: k = k + 1 6: αk = (~r T k−1~rk−1)/(~p T k−1A~pk−1) 7: ~xk = ~xk−1 + αk~pk−1 8: ~rk = ~rk−1 − αkA~pk−1 9: βk = (~r T k ~rk)/(~r T k−1~rk−1) 10: ~pk = ~rk + βk~pk−1 11: until convergence criterion is satisfied We first define residual and error, before explaining the context in which the CG algorithm was derived. Definition 4.5 Consider iterate ~xk for solving A~x = ~b with exact solution ~x ∗. The residual of iterate ~xk is given by ~rk = ~b−A~xk. The error of iterate ~xk is given by ~ek = ~x ∗ − ~xk. Note that A~ek = A~x ∗ −A~xk = ~b−A~xk = ~rk. Recall the iteration formula of the steepest descent method, ~xk = ~xk−1 + αk~rk−1 where ~rk−1 = −∇φ(~xk−1) = ~b−A~xk−1 is the direction of steepest descent. We have seen in an example that the steepest descent direction may not be a suitable direction when the linear system is ill- conditioned. The CG algorithm aims at making a step in a better direction. It considers the iteration formula ~xk = ~xk−1 + αk~pk−1, (4.3) or ~xk = ~xk−1 + ~qk, (4.4) 70 Chapter 4. The Conjugate Gradient Method for Sparse SPD Systems with ~qk = αk~pk−1, where ~pk−1 is the step direction and αk is the step length, which are chosen optimally in the following sense. Definition 4.6 Let A ∈ Rn×n and ~r0 ∈ Rn. The Krylov space Kk(~r0, A) generated by ~r0 and A is the subspace of Rn defined by Kk(~r0, A) = span{~r0, A~r0, A2~r0, . . . , Ak−1~r0}. Considering Eq. (4.4), the CG method determines the vector ~qk in the Krylov space Kk(~r0, A) such that the error ~ek is minimised in the A-norm. From Eq. (4.4) we have ~x∗ − ~xk = ~x∗ − ~xk−1 − ~qk or ~ek = ~ek−1 − ~qk, so CG chooses ~qk in Kk(~r0, A) such that ‖~ek‖A = ‖~ek−1 − ~qk‖A is minimal over all vectors ~q in Kk(~r0, A). In the next section we show that Algorithm 4.4 achieves this goal. This optimality leads to desirable convergence properties for broad classes of problems, significantly improving over steepest descent. Note also that the cost per iteration of the CG algorithm is not much larger than the cost of steepest descent: CG requires one matrix-vector product, two scalar products, and three axpy operations per iteration. 4.4 Properties of the Conjugate Gradient Method In this section we will show and discuss some properties of the CG algorithm. To make the proofs somewhat easier, we consider, without loss of generality, the case where we solve A~x = ~b with initial guess ~x0 = 0. This is no restriction, because it is equivalent to applying CG to A(~x− ~x0) = ~b−A~x0 or A~y = ~c. when ~x0 6= 0. Note that the residuals for A~y = ~c are the same as for A~x = ~b, ~rk = ~c−A~yk = (~b−A~x0)−A(~xk − ~x0) = ~b−A~xk, which implies that the step directions and α and β parameters in the CG algo- rithm also don’t change. 4.4.1 Orthogonality Properties of Residuals and Step Directions An important property of CG is that the step directions ~pi are mutually A- orthogonal or A-conjugate, from which the method derives its name. 4.4. Properties of the Conjugate Gradient Method 71 Definition 4.7 Let A ∈ Rn×n be and SPD matrix. Then vectors ~pi and ~pj ∈ Rn are called A-orthogonal or A-conjugate if ~pTi A~pj = 0. A-orthogonality of the step directions in Algorithm 4.4 is proven as part of the following theorem. Theorem 4.8 Let ~x0 = 0 in the CG algorithm (Algorithm 4.4). As long as convergence has not been reached before iteration k (~rk−1 6= 0), there are no divisions by 0, and the following hold: (A) Let Xk = span{~x1, . . . , ~xk}, Pk = span{~p0, . . . , ~pk−1}, Rk = span{~r0, . . . , ~rk−1}, Kk = span{~r0, A~r0, A2~r0, . . . , Ak−1~r0} = Kk(~r0, A). Then Xk = Pk = Rk = Kk. (B) The residuals are mutually orthogonal: ~rTk ~rj = 0 (j < k). (C) The step directions are mutually A-orthogonal: ~pTkA~pj = 0 (j < k). Proof. The proof is by induction on k. The details are quite involved; we provide a sketch of the proof. (A) Assume Xk−1 = Pk−1 = Rk−1 = Kk−1. Line 7 in Algorithm 4.4 (l7), ~xk = ~xk−1 + αk~pk−1, shows that Xk = Pk. And (l10), ~pk = ~rk + βk~pk−1, shows that Pk = Rk. Finally, (l8), ~rk = ~rk−1 − αkA~pk−1, shows that Rk = Kk. (B) Multiplying (l8), ~rk = ~rk−1 − αkA~pk−1, with ~rj on the right, we get ~rTk ~rj = ~r T k−1~rj − αk~pTk−1A~rj . Case j < k − 1: ~rTk ~rj = 0, since, by the induction hypothesis, ~rTk−1~rj = 0, and ~p T k−1A~rj = 0 since ~rj ∈ Pk−1. 72 Chapter 4. The Conjugate Gradient Method for Sparse SPD Systems Case j = k − 1: ~rTk ~rk−1 = 0, if αk = (~r T k−1~rk−1)/(~p T k−1A~rk−1). However, this is equivalent to (l6): αk = ~rTk−1~rk−1 ~pTk−1A~rk−1 = ~rTk−1~rk−1 ~pTk−1A(~pk−1 − βk−1~pk−2) (by (l10)) = ~rTk−1~rk−1 ~pTk−1A~pk−1 (by A-orthogonality). (C) Multiplying (l10), ~pk = ~rk + βk~pk−1, with A~pj on the right, we get ~pTkA~pj = ~r T k A~pj + βk~p T k−1A~pj . Case j < k − 1: ~pTkA~pj = 0, since, by the induction hypothesis, ~rTk A~pj = ~r T k (~rj − ~rj+1)/αj+1 = 0 (using (l8)), and ~pTk−1A~pj = 0. Case j = k − 1: ~pTkA~pk−1 = 0, if βk = −(~rTk A~pk−1)/(~pTk−1A~pk−1). However, this is equivalent to (l9): βk = −(~rTk A~pk−1) ~pTk−1A~pk−1 = −(~rTk A~pk−1) ~pTk−1A~pk−1 αk αk = ~rTk (−αkA~pk−1) ~rTk−1~rk−1 (by (l6)) = ~rTk (~rk − ~rk−1) ~rTk−1~rk−1 (by (l8)) = ~rTk ~rk ~rTk−1~rk−1 (by residual orthogonality) 4.4. Properties of the Conjugate Gradient Method 73 Some additional comments can be made about the residual orthogonality, ~rTk ~rj = 0 (j < k). (4.5) • Condition (4.5) implies that, for consecutive residuals, ~rTk ~rk−1 = 0, as in the steepest descent method. However, condition (4.5) implies that ~rk is orthogonal to all previous residuals, which is clearly a much stronger property than for steepest descent. In fact, this implies finite termina- tion in at most n steps: since the ~ri are mutually orthogonal vectors in Rn, and there can be at most n nonzero orthogonal residual vectors in Rn, we have ~rn = 0. So we have proved the following theorem: Theorem 4.9 The CG algorithm converges to the exact solution in at most n steps (in exact arithmetic). This property may indicate that we can consider CG as a direct method, but in practice it is used as an iterative method, because in many practical cases it attains an accurate approximation in much fewer than n steps. Figure 4.4.1 compares the performance of the CG and steepest descent methods for the 2D Laplacian matrix. • It can be shown that, in the update ~xk = ~xk−1 + αk~pk−1, CG chooses the optimal step length along direction ~pk−1, as in steepest descent: d dαk φ(~xk(αk)) = 0. It is easy to show that this requires step length αk = ~rTk−1~pk−1 ~pTk−1A~pk−1 , which can be shown to be equivalent to (l6) in Algorithm 4.4. 4.4.2 Optimal Error Reduction in the A-Norm Theorem 4.10 Let ~x0 = 0 in the CG algorithm (Algorithm 4.4). As long as convergence has not been reached before iteration k (~rk−1 6= 0), the iterate ~xk minimises ‖~ek‖A = ‖~x∗ − ~xk‖A over the Krylov space Kk(~r0, A). 74 Chapter 4. The Conjugate Gradient Method for Sparse SPD Systems 100 200 300 400 500 iterations -6 -5 -4 -3 -2 -1 0 1 lo g1 0(r es idu al) cg steepest descent Figure 4.4.1: Comparison of steepest descent and CG convergence histories for the 2D Laplacian with N = 32 and n = 1024, with RHS a vector of all-ones, and zero initial guess. The condition number κ2(A) ≈ 440. Proof. We know that ~xk ∈ Kk(~r0, A). Consider any vector ~y ∈ Kk(~r0, A) that is different from ~x, i.e., ~y = ~xk + ~z for some ~z ∈ Kk(~r0, A), ~z 6= 0. Then ~e~y = ~x ∗ − ~y = ~x∗ − ~xk − ~z = ~ek − ~z. We have ‖~e~y‖A = (~ek − ~z)TA(~ek − ~z) = ~eTkA~ek − ~eTkA~z − ~zTA~ek + ~zTA~z = ~eTkA~ek − 2~zTA~ek + ~zTA~z (since A = AT ) = ~eTkA~ek − 2~zT~rk + ~zTA~z (since A~ek = ~rk) = ‖~ek‖A + ~zTA~z (since ~z ∈ Kk(~r0, A) = span{~r0, . . . , ~rk−1}), so ‖~e~y‖A ≥ ‖~ek‖A, since A is SPD. 4.4. Properties of the Conjugate Gradient Method 75 Note: this theorem implies that ‖~ek‖A ≤ ‖~ek−1‖A, since Kk−1 ⊂ Kk. We say that convergence in the A-norm is monotone. 4.4.3 Convergence Speed The following theorems can be proved about the convergence speed of the steep- est descent and CG methods. Theorem 4.11 Let A ∈ Rn×n be SPD. Let κ be the 2-norm condition number of A, κ = κ2(A). Then the errors of the iterates in the steepest descent method satisfy ‖~ek‖A ‖~e0‖A ≤ ( κ− 1 κ+ 1 )k . Theorem 4.12 Let A ∈ Rn×n be SPD. Let κ be the 2-norm condition number of A, κ = κ2(A). Then the errors of the iterates in the CG method satisfy ‖~ek‖A ‖~e0‖A ≤ 2 (√ κ− 1√ κ+ 1 )k . It can be shown that, for large κ, this leads to the following estimates for the number of iterations k required to converge to ‖~ek‖A ‖~e0‖A ≈ with a fixed small : • steepest descent: k = O(κ). • CG: k = O( √ κ). Example 4.13: Condition Number of 1D Laplacian Consider the 1D Laplacian matrix A = −2 1 1 −2 1 1 −2 1 . . . . . . . . . 1 −2 1 1 −2 ∈ Rn×n. 76 Chapter 4. The Conjugate Gradient Method for Sparse SPD Systems It can be shown that the n eigenvalues of this matrix are given by λk = 2 (cos(kpih)− 1) (k = 1, . . . , n) = −4 sin2(kpih/2) = −4 sin2 ( kpi 2(n+ 1) ) , where h = 1/(n + 1). Since A is symmetric and all eigenvalues are strictly negative, this matrix is symmetric negative definite. (I.e, −A is SPD.) Since A is symmetric, κ2(A) = |λ|max(A) |λ|min(A) . It is easy to show that κ2(A) = O ( 1 h2 ) = O(n2). This means that linear systems with A become increasingly harder to solve for iterative methods when n grows: • Steepest descent takes O(κ) = O(n2) iterations, and since the cost per iteration is O(n), the total cost is O(n3). • CG takes O(√κ) = O(n) iterations, so the total cost is O(n2). Example 4.14: Condition Number of 2D Laplacian Consider the 2D Laplacian matrix A = T I 0 . . . 0 I T I 0 . . . 0 0 I T I 0 . . . 0 ... . . . . . . . . . ... 0 . . . 0 I T I 0 0 . . . 0 I T I 0 . . . 0 I T ∈ Rn×n, where n = N2 and T and I are block matrices ∈ RN×N (T is tridiagonal with elements 1, -4, 1 on the three diagonals). (Here, N is the number of interior points in the x and y directions, and h = 1/(N + 1). Note that A does not include the 1/h2 factor.) It can be shown that the n = N2 eigenvalues of this matrix are given by the N2 possible sums of the N eigenvalues of the 1D Laplacian matrices in the x and y directions: λk,l = 2 (cos(kpih)− 1 + cos(lpih)− 1) (k = 1, . . . , N ; l = 1, . . . , N) = −4(sin2(kpih/2) + sin2(lpih/2)) = −4 ( sin2 ( kpi 2(N + 1) ) + sin2 ( lpi 2(N + 1) )) , 4.5. Preconditioning for the Conjugate Gradient Method 77 where h = 1/(N + 1). Since A is symmetric and all eigenvalues are strictly negative, this matrix is symmetric negative definite. (I.e, −A is SPD.) It is easy to show that κ2(A) = O ( 1 h2 ) = O(N2) = O(n). This means that linear systems with A become increasingly harder to solve for iterative methods when n grows: • Steepest descent takes O(κ) = O(n) iterations, and since the cost per iteration is O(n), the total cost is O(n2). • CG takes O(√κ) = O(√n) iterations, so the total cost is O(n3/2). 4.5 Preconditioning for the Conjugate Gradient Method 4.5.1 Preconditioning for Solving Linear Systems We saw in the previous section that the number of iterations, k, required to reach a specific tolerance when solving a linear system A~x = ~b using CG, satisfies k = O( √ κ(A)). If A is ill-conditioned, this may lead to large numbers of iterations, which can be especially undesirable when the condition number grows as a function of problem size, like for our 2D model problem. The idea of preconditioning the linear system aims at reducing the number of iterations an iterative method requires for convergence by reformulating the linear system as an equivalent problem that has the same solution, but features a matrix with a smaller condition number. The first approach is the idea of left preconditioning: multiply A~x = ~b on the left with a nonsingular matrix P ∈ Rn×n to obtain the equivalent linear system PA~x = P~b, where the preconditioning matrix (or preconditioner) P is chosen such that κ(PA) κ(A), perhaps by choosing P such that P ≈ A−1. Such a choice may reduce the condition number and the number of iterations sub- stantially, and will not increase the cost per iteration too much if P is a cheaply computable approximation of A−1. More broadly, the convergence speed of it- erative methods for general linear systems A~x = ~b, where A is not necessarily SPD, usually depends on the eigenvalue distribution of the matrix – e.g., the clustering of eigenvalues – and its condition number, and the goal of precondi- tioning is to improve the eigenvalue distribution of PA such that the iterative method converges faster than for A. 78 Chapter 4. The Conjugate Gradient Method for Sparse SPD Systems An alternative approach to left preconditioning is the idea of right precon- ditioning with a nonsingular matrix P ∈ Rn×n, which reformulates the system as APP−1~x = ~b, and solves AP~y = ~b using the iterative method, where ~x is obtained at the end from P~y = ~x. 4.5.2 Left Preconditioning for CG The general matrix preconditioning strategies described above are, however, not directly applicable to CG, because CG requires the system matrix to be SPD, and, with the original A being SPD, PA or AP are generally not symmetric. However, preconditioning can be applied to CG as follows. When applying preconditioning to linear system A~x = ~b with A ∈ Rn×n an SPD matrix, we choose a preconditioning matrix P that is SPD. The matrix P can always be written as the product XXT , where X is a nonsingular matrix in Rn×n: P = V ΛV T = V √ Λ √ ΛV T = (V √ Λ)(V √ Λ)T = XXT , where V ∈ Rn×n contains n orthonormal eigenvectors of A, and Λ ∈ Rn×n is a diagonal matrix containing the corresponding eigenvalues, which are strictly positive. The we can, using a change of variables, reformulate the left-preconditioned linear system as an equivalent system with an SPD matrix as follows: PA~x = P~b XXTA~x = XXT~b XTAXX−1~x = XT~b (XTAX)~y = XT~b, B~y = ~c, where B = XTAX ~c = XT~b ~y = X−1~x. The following result shows that B is SPD, so we can apply CG to B~y = ~c. 4.5. Preconditioning for the Conjugate Gradient Method 79 Theorem 4.15 Let A ∈ Rn×n be SPD and X ∈ Rn×n be nonsingular. Then B = XTAX is SPD. Proof. B is symmetric since A is symmetric. Moreover, for any ~x 6= 0, ~xT (XTAX)~x = (X~x)TA(X~x) > 0 since A is SPD and X~x 6= 0 because X is nonsingular. Also, B has the same eigenvalues as PA, so the eigenvalues of PA determine the 2-condition number of B, and hence, the speed of convergence of CG applied to B~y = ~c. Theorem 4.16 Let A ∈ Rn×n be SPD and X ∈ Rn×n be nonsingular, with P = XXT . Then B = XTAX has the same eigenvalues as PA. Proof. This follows because B is similar to PA: B = XTAX = (X−1X)XTAX = X−1(PA)X, which implies that B and PA have the same eigenvalues. 4.5.3 Preconditioned CG (PCG) Algorithm Applying CG to B~y = ~c results in the following algorithm, where we use notation with a hat for the resid- uals ~̂rk and search directions ~̂p0 associated with formulating the CG algorithm for computing ~y, rather than ~x. 80 Chapter 4. The Conjugate Gradient Method for Sparse SPD Systems Algorithm 4.17: Preconditioned CG (PCG) Method – Version 1 Input: SPD matrix A, RHS ~b; initial guess ~x0; SPD preconditioner P = XX T Output: approximation ~xk after stopping criterion is satisfied 1: B = XTAX 2: ~c = XT~b . we will apply CG to B~y = ~c 3: ~y0 = X −1~x0 4: ~̂r0 = ~c−B~y0 5: ~̂p0 = ~̂r0 6: k = 0 7: repeat 8: k = k + 1 9: αk = (~̂r T k−1~̂rk−1)/(~̂p T k−1B~̂pk−1) 10: ~yk = ~yk−1 + αk~̂pk−1 11: ~̂rk = ~̂rk−1 − αkB~̂pk−1 12: βk = (~̂r T k ~̂rk)/(~̂r T k−1~̂rk−1) 13: ~̂pk = ~̂rk + βk~̂pk−1 14: until stopping criterion is satisfied 15: ~xk = X~yk It turns out, however, that the PCG algorithm can be reformulated in terms of the original ~x variable, in a way that involves only the P and A matrices, without explicit need for the X and XT factors. This proceeds as follows. We first multiply (l10) in Algorithm 4.17 by X from the left to convert from ~y to ~x = X~y: X~yk = X~yk−1 + αkX~̂pk−1 ~xk = ~xk−1 + αk~pk−1, where we have defined the search direction for ~xk, ~pk−1, by ~pk−1 = X~̂pk−1. Next we observe that residuals for ~y and ~x are related by ~̂r = ~c−B~y = XT~b−XTAX~y = XT (~b−A~x) = XT~r, which we use to transform (l11) to ~̂rk = ~̂rk−1 − αkB~̂pk−1 XT~rk = X T~rk−1 − αkXTAX~̂pk−1 ~rk = ~rk−1 − αkA~pk−1. 4.5. Preconditioning for the Conjugate Gradient Method 81 Then we multiply (l13) by X from the left to convert from ~̂p to ~p: X~̂pk = X~̂rk + βkX~̂pk−1 ~pk = XX T~rk + βk~pk−1 ~pk = P~rk + βk~pk−1, where we have used that P = XXT . Finally, to convert the scalar products in αk and βk to use ~rk and ~pk, we write ~̂r T k ~̂rk = (X T~rk) T (XT~rk) = ~rTkXX T~rk = ~rTk P~rk, and ~̂p T k−1B~̂pk−1 = (X −1~pk−1)TXTAX(X−1~pk−1) = ~pTk−1X −TXTAXX−1~pk−1 = ~pTk−1A~pk−1, resulting in αk = ~rTk−1P~rk−1 ~pTk−1A~pk−1 , βk = ~rTk P~rk ~rTk−1P~rk−1 . This gives the second version of the PCG algorithm: Algorithm 4.18: PCG Method – Version 2 Input: SPD matrix A, RHS ~b; initial guess ~x0; SPD preconditioner P Output: sequence of approximations ~x1, ~x2, . . . 1: ~r0 = ~b−A~x0 2: ~p0 = P~r0 3: k = 0 4: repeat 5: k = k + 1 6: αk = (~r T k−1P~rk−1)/(~p T k−1A~pk−1) 7: ~xk = ~xk−1 + αk~pk−1 8: ~rk = ~rk−1 − αkA~pk−1 9: βk = (~r T k P~rk)/(~r T k−1P~rk−1) 10: ~pk = P~rk + βk~pk−1 11: until stopping criterion is satisfied In practice, multiplication of a residual ~r by P to obtain a preconditioned residual ~q = P~r usually involves solving a linear system: since P ≈ A−1, we normally know the sparse matrix P−1 ≈ A, and we solve P−1~q = ~r 82 Chapter 4. The Conjugate Gradient Method for Sparse SPD Systems for ~q. This step needs to be performed only once per iteration, and it is worth- while to rewrite the algorithm once more to indicate this explicitly: Algorithm 4.19: PCG Method – Version 3 Input: SPD matrix A, RHS ~b; initial guess ~x0; SPD preconditioner P Output: sequence of approximations ~x1, ~x2, . . . 1: ~r0 = ~b−A~x0 2: solve P−1~q0 = ~r0 for ~q0 (the preconditioned residual) 3: ~p0 = ~q0 4: k = 0 5: repeat 6: k = k + 1 7: αk = (~r T k−1~qk−1)/(~p T k−1A~pk−1) 8: ~xk = ~xk−1 + αk~pk−1 9: ~rk = ~rk−1 − αkA~pk−1 10: solve P−1~qk = ~rk for ~qk (the preconditioned residual) 11: βk = (~r T k ~qk)/(~r T k−1~qk−1) 12: ~pk = ~qk + βk~pk−1 13: until stopping criterion is satisfied 4.5.4 Preconditioners for PCG We now briefly describe some standard preconditioners that are often used when solving linear systems A~x = ~b. We begin by writing A as a sum of its diagonal part and its strictly lower and upper triangular part, A = AD −AL −AU , where the convention of using negative signs for the triangular parts stems from SPD matrices with positive diagonal elements and negative off-diagonal elements that arise in the context of certain PDE problems (e.g., −A for our 2D Lapla- cian). Example 4.20 The following standard preconditioning matrices are often used as cheaply computable approximations of A−1, where we assume that the matrix inverses in the expressions exist: 1. Jacobi: P = A−1D 2. Gauss-Seidel (GS): P = (AD −AL)−1 3. Symmetric Gauss-Seidel (SGS): P = (AD −AU )−1AD(AD −AL)−1 4. Successive Over-Relaxation (SOR): P = ω(AD − ωAL)−1 (ω ∈ (0, 2)) 4.5. Preconditioning for the Conjugate Gradient Method 83 5. Symmetric Successive Over-Relaxation (SSOR): P = ω(2− ω)(AD − ωAU )−1AD(AD − ωAL)−1 (ω ∈ (0, 2)) A few notes: • 1., 3. and 5. give symmetric preconditioners P when A is symmetric (i.e., ATL = AU ), and they are the only ones that can be used with CG. • Preconditioners 2.-5. contain (a sequence of) triangular matrices, which can be inverted inexpensively by forward or backward substitution. If A is sparse with nnz(A) = O(n), the cost of applying these preconditioners is O(n), so preconditioning does not increase the computational complexity per iteration beyond O(n). It may substantially reduce the number of iterations required for convergence, and hence, may lead to faster overall solve times and better scalability for large problems. 4.5.5 Using Preconditioners as Stand-Alone Iterative Methods The preconditioning matrices presented in the previous section can also be used as iterative methods by themselves, as we now explain. When solving A~x = ~b with exact solution ~x∗ and residual and error ~r = ~b−A~x, ~e = ~x∗ − ~x, satisfying A~e = ~r, we start from the identity ~x∗ = ~x+ ~e = ~x+A−1~r. We obtain a stationary iterative method by considering an easily computable approximate inverse P of A, P ≈ A−1, and writing ~xk+1 = ~xk + P~rk, (4.6) where ~rk = ~b−A~xk. We easily derive the error propagation equation ~x∗ − ~xk+1 = ~x∗ − ~xk − P~rk, ~ek+1 = ~ek − PA~ek, ~ek+1 = (I − PA)~ek. It can be shown that the iteration converges for any initial guess ~x0 when ‖I − PA‖p < 1 in some p-norm. 84 Chapter 4. The Conjugate Gradient Method for Sparse SPD Systems Example 4.21 The Gauss-Seidel (GS) iterative method for A~x = ~b with A ∈ Rn×n computes a new approximation ~xnew from a previous iterate ~xold by (considering a simple 3× 3 example) a11x new 1 + a12x old 2 + a13x old 3 = b1 a21x new 1 + a22x new 2 + a23x old 3 = b2 a31x new 1 + a32x new 2 + a33x new 3 = b3. (4.7) Rearranging (for the general n × n case) yields the defining equation for the Gauss-Seidel method: xnewi = 1 aii bi − i−1∑ j=1 aijx new j − n∑ j=i+1 aijx old j . (4.8) Using A = AD −AL −AU , we can derive a matrix expression for this method by A~x = ~b (AD −AL −AU )~x = ~b (AD −AL)~xk+1 = AU~xk +~b ~xk+1 = (AD −AL)−1((AD −AL −A)~xk +~b) = ~xk + (AD −AL)−1(~b−A~xk) = ~xk + (AD −AL)−1~rk. Comparing with general update formula (4.6), we identify the preconditioning matrix P for GS as P = (AD −AL)−1. A few notes: • The Jacobi iteration is defined by xnewi = 1 aii bi − i−1∑ j=1 aijx old j − n∑ j=i+1 aijx old j . (4.9) • The preconditioning matrix for Symmetric Gauss-Seidel (SGS) can be de- rived by concatenating a forward and a backward Gauss-Seidel sweep: ~xk+1/2 = ~xk + (AD −AL)−1~rk, ~xk+1 = ~xk+1/2 + (AD −AU )−1~rk+1/2. • The Successive Over-Relaxation (SOR) method for A~x = ~b is an iterative method in which for every component a linear combination is taken of a 4.5. Preconditioning for the Conjugate Gradient Method 85 Gauss-Seidel-like update and the old value: xnewi = (1− ω)xoldi + ω 1 aii bi − i−1∑ j=1 aij x new j − n∑ j=i+1 aij x old j , with ω a fixed weight. Symmetric Successive Over-Relaxation (SSOR) is obtained from combining a forward and a backward SOR sweep. 86 Chapter 4. The Conjugate Gradient Method for Sparse SPD Systems Chapter 5 The GMRES Method for Sparse Nonsymmetric Systems 5.1 Minimising the Residual In this chapter, we consider the generalised minimal residual (GMRES) iterative method for solving linear systems A~x = ~b with A a nonsingular matrix ∈ Rn×n. We recall that the CG method for linear systems with A SPD seeks the optimal update in the Krylov space generated by A and the first residual, ~r0: Ki+1(~r0, A) = span{~r0, A~r0, A2~r0, . . . , Ai~r0}. CG considers the update formula ~xi+1 = ~xi + αi+1~pi, (5.1) or ~xi+1 = ~xi + ~zi, (5.2) and the update ~zi is determined in the Krylov space Ki+1(~r0, A) such that the error ~ei+1 is minimised in the A-norm, i.e., with ~ei+1 = ~ei − ~zi, each step of CG minimises the A-norm of the error: min ~zi∈Ki+1(~r0,A) ‖~ei+1‖A. Note that error minimisation in the A-norm is only possible when A is SPD. GMRES, on the contrary, is intended for linear systems with generic nonsin- gular matrices A, non necessarily symmetric, and also considers optimal updates in the same Krylov space as CG, Ki+1(~r0, A) = span{~r0, A~r0, A2~r0, . . . , Ai~r0}. It seeks ~zi in Ki+1(~r0, A) such that ~xi+1 = ~x0 + ~zi, 87 88 Chapter 5. The GMRES Method for Sparse Nonsymmetric Systems with residual ~ri+1 = ~r0 −A~zi, is minimal in the 2-norm: min ~zi∈Ki+1(~r0,A) ‖~ri+1‖. This minimisation of the residual in the 2-norm is more general than minimi- sation of the error in the A-norm, because it can be done for any matrix A. The resulting formulas are somewhat less economical than CG for SPD A, but GMRES is a very powerful approach for general linear systems. The GMRES method proceeds as follows: GMRES computes an orthonormal basis {~q0, . . . , ~qi} for Ki+1(~r0, A), Qi+1 = ~q0 ~q1 . . . ~qi , and it does so in an incremental way, computing an additional orthonormal vector ~qi for every iteration. The matrix Qi+1, with the orthonormal basis vectors as its columns, satisfies QTi+1Qi+1 = Ii+1. GMRES chooses the update ~zi ∈ Ki+1(~r0, A), which can be represented as ~zi = Qi+1~y for some ~y ∈ Ri+1. GMRES finds the optimal ~y ∈ Ri+1 in the expression ~xi+1 = ~x0 +Qi+1~y, that minimises ~ri+1 in the 2-norm. Note that all vectors norms in this chapter denote vector 2-norms. 5.2 Arnoldi Orthogonalisation Procedure GMRES generates an orthonormal basis for the Krylov space Ki+1(~r0, A) = span{~r0, A~r0, A2~r0, . . . , Ai~r0}. by setting ~q0 = ~r0/‖~r0‖ and applying modified Gram-Schmidt to orthogonalise the vectors {~q0, A~q0, A~q1, . . . , A~qi−1}. Gram-Schmidt generates a new vector ~vm+1 orthogonal to the previous {~q0, ~q1, . . . , ~qm} by subtracting from A~qm the components in the directions of the previous ~qj : ~vm+1 = A~qm − h0,m~q0 − h1,m~q1 − . . .− hm,m~qm, where the projection coefficients hj,m are determined in the standard way. The new orthonormal vector ~qm+1 is then determined by normalising ~vm+1: ~qm+1 = ~vm+1/hm+1,m 5.2. Arnoldi Orthogonalisation Procedure 89 where hm+1,m = ‖~vm+1‖. So the basis vectors {~q0, ~q1, . . . , ~qm, ~qm+1} satisfy hm+1,m~qm+1 = A~qm − h0,m~q0 − h1,m~q1 − . . .− hm,m~qm. This procedure to generate an orthonormal basis of the Krylov space is called the Arnoldi procedure. It can easily be shows that the set of Arnoldi vectors generated by the procedure is a basis for span{~r0, A~r0, A2~r0, . . . , Ai~r0}: Theorem 5.1 Let {~q0, . . . , ~qi} be the vectors generated by the Arnoldi procedure. Then span{~q0, . . . , ~qi} = span{~r0, A~r0, A2~r0, . . . , Ai~r0}. Proof. (sketch) This follows from a simple induction argument based on hm+1,m~qm+1 = A~qm − h0,m~q0 − h1,m~q1 − . . .− hm,m~qm. The Arnoldi procedure is given by: Algorithm 5.2: Arnoldi Procedure for an Orthonormal Basis of Ki+1(~r0, A) Input: matrix A ∈ Rn×n; vector ~r0 Output: vectors ~q0, . . . , ~qi that form an orthonormal basis of Ki+1(~r0, A) ρ = ‖~r0‖ ~q0 = ~r0/ρ for m = 0 : i− 1 do ~v = A~qm for j = 0 : m do hj,m = ~q T j ~v ~v = ~v − hj,m~qj end for hm+1,m = ‖~v‖ ~qm+1 = ~v/hm+1,m end for The vectors and coefficients computed in during the Arnoldi procedure can 90 Chapter 5. The GMRES Method for Sparse Nonsymmetric Systems be written in matrix form as: A ~q0 ~q1 . . . ~qi = ~q0 ~q1 . . . ~qi ~qi+1 h0,0 h0,1 h0,2 . . . h0,i h1,0 h1,1 h1,2 h2,1 h2,2 . . . ... h3,2 . . . . . . . . . 0 hi,i−1 hi,i hi+1,i , or AQi+1 = Qi+2H˜i+1. Note that Qi+1 ∈ Rn×(i+1), Qi+2 ∈ Rn×(i+2), and H˜i+1 ∈ R(i+2)×(i+1). GMRES uses this relation to minimise ‖~ri+1‖ over the Krylov space in an efficient manner, as explained in the next section. Note also that, when i+ 1 = n, the process terminates with hn,n−1 = ‖~vn‖ = 0, because there cannot be more than n orthogonal vectors in Rn. At this point, we obtain AQ = QH, where Q ∈ Rn×n is orhtogonal and H ∈ Rn×n is a square matrix with zeros below the first subdiagonal: H = h0,0 h0,1 h0,2 . . . h0,n−1 h1,0 h1,1 h1,2 h2,1 h2,2 . . . ... h3,2 . . . . . . . . . . . . 0 hn−2,n−3 . . . hn−1,n−2 hn−1,n−1 . This type of matrix is called an (upper) Hessenberg matrix: Definition 5.3 Let H ∈ Rn×n. Then H is called an (upper) Hessenberg matrix if hij = 0 for j ≤ i− 2. This provides an orthogonal decomposition of A into Hessenberg form: QTAQ = H. 5.3 GMRES Algorithm GMRES uses the relation AQi+1 = Qi+2H˜i+1, (5.3) 5.3. GMRES Algorithm 91 as obtained from the Arnoldi procedure, to minimise ‖~ri+1‖ over the Krylov space in an efficient manner. Since the columns of Qi+1 form an orthonormal basis for Ki+1(~r0, A), GM- RES chooses the optimal ~y ∈ Ri+1 in ~xi+1 = ~x0 +Qi+1~y, that minimises ~ri+1 in the 2-norm. Note that, in Eq. (5.3), Qi+1 ∈ Rn×(i+1) and Qi+2 ∈ Rn×(i+2). Since n is typically large (millions or billions) and i is small (perhaps 20-30 or so), these matrices have many rows, so we will seek to avoid computing with them directly. On the contrary, H˜i+1 ∈ R(i+2)×(i+1) is a small matrix, and we will exploit this as follows. Using Eq. (5.3), we write ‖~ri+1‖ = ‖~r0 −AQi+1~y‖ = ‖~r0 −Qi+2H˜i+1~y‖. We know that ~q0 = ~r0/‖~r0‖ forms the first column of Qi+2, so we can write ~r0 = ‖~r0‖Qi+2~e1, where ~e1 ∈ Ri+2 is the first canonical basis vector, ~e1 = (1, 0, . . . , 0)T . Therefore, ‖~ri+1‖2 = ∥∥∥Qi+2 (‖~r0‖~e1 − H˜i+1~y)∥∥∥2 = ( Qi+2 ( ‖~r0‖~e1 − H˜i+1~y ))T ( Qi+2 ( ‖~r0‖~e1 − H˜i+1~y )) = ( ‖~r0‖~e1 − H˜i+1~y )T QTi+2Qi+2 ( ‖~r0‖~e1 − H˜i+1~y ) = ( ‖~r0‖~e1 − H˜i+1~y )T ( ‖~r0‖~e1 − H˜i+1~y ) = ∥∥∥‖~r0‖~e1 − H˜i+1~y∥∥∥2 . Minimising ‖~ri+1‖ over ~y ∈ Ri+1 then boils down to solving a small least-squares problem with an overdetermined matrix H˜i+1 ∈ R(i+2)×(i+1). For example, the normal equations for this problem are given by H˜Ti+1H˜i+1~y = ‖~r0‖H˜Ti+1~e1. We find ~xi+1 from ~xi+1 = ~x0 +Qi+1~y. In accurate implementations, the least-squares problem is solved using QR decomposition. As i grows, the QR decomposition does not need to be recom- puted for every new i, but can be updated cheaply as explained in [Saad, 2003]. Also, ~xi+1, or even ~y, does not need to be computed in every iteration. Since the least-squares problem grows, it is common to restart the algorithm every 20 or so iterations. The GMRES method for A~x = ~b 92 Chapter 5. The GMRES Method for Sparse Nonsymmetric Systems is given by: Algorithm 5.4: GMRES Method for A~x = ~b Input: matrix A ∈ Rn×n; initial guess ~x0 Output: sequence of approximations ~x1, ~x2, . . . 1: ~r0 = ~b−A~x0 2: ρ = ‖~r0‖ 3: ~q0 = ~r0/ρ 4: m = 0 5: repeat 6: ~v = A~qm 7: for j = 0 : m do 8: hj,m = ~q T j ~v 9: ~v = ~v − hj,m~qj 10: end for 11: hm+1,m = ‖~v‖ 12: ~qm+1 = ~v/hm+1,m 13: find ~y that minimises ‖ρ~e1 − H˜m+1~y‖ 14: ~xm+1 = ~x0 +Qm+1~y 15: ‖~rm+1‖ = ‖ρ~e1 − H˜m+1~y‖ 16: m = m+ 1 17: until convergence criterion is satisfied 5.4 Convergence Properties of GMRES The following convergence result can be proved for the case that A is diagonal- isable, A = V ΛV −1. (Note that eigenvalues of A ∈ Rn×n may be complex.) Theorem 5.5 Let A ∈ Rn×n, nonsingular, be diagonalisable, A = V ΛV −1. Then the residu- als generated in the GMRES method satisfy ‖~ri‖ ‖~r0‖ ≤ κ2(V ) minpi(x)∈Pi maxΣ(A) |pi(λ)|. Here, pi(x) is a polynomial of degree at most i in Pi, the set of polynomials of degree at most i which satisfy pi(0) = 1. Σ(A) is the eigenvalue spectrum of A, i.e., the set of eigenvalues of A. This theorem indicates that the convergence behaviour depends on the con- dition number of the matrix of eigenvectors of A, and on the distribution of the eigenvalues of A in the complex plane. E.g., clustered spectra tend to lead to fast convergence, since a low-degree polynomial can then typically be found that is small on the whole spectrum. Since GMRES updates can be written in terms of polynomials of A multiplying ~r0, GMRES can be interpreted as seeking the optimal polynomial in Pi, which is used in the proof of this theorem. 5.5. Preconditioned GMRES 93 5.5 Preconditioned GMRES Left preconditioning for GMRES proceeds by considering PA~x = P~b with, e.g., P ≈ A−1. Alternatively, right preconditioning for GMRES proceeds by considering APP−1~x = ~b or AP~z = ~b, P−1~x = ~z. The two variants perform similarly, but right preconditioning is sometimes pre- ferred because it works with the original residual: ~r0 = ~b−AP~z0 = ~b−APP−1~x0 = ~b−A~x0. This is right-preconditioned GMRES: Algorithm 5.6: Right-Preconditioned GMRES Method for A~x = ~b Input: matrix A ∈ Rn×n; initial guess ~x0; preconditioner P ≈ A−1 Output: sequence of approximations ~x1, ~x2, . . . 1: ~r0 = ~b−A~x0 2: ρ = ‖~r0‖ 3: ~q0 = ~r0/ρ 4: m = 0 5: repeat 6: ~v = AP~qm 7: for j = 0 : m do 8: hj,m = ~q T j ~v 9: ~v = ~v − hj,m~qj 10: end for 11: hm+1,m = ‖~v‖ 12: ~qm+1 = ~v/hm+1,m 13: find ~y that minimises ‖ρ~e1 − H˜m+1~y‖ 14: ~xm+1 = ~x0 + PQm+1~y 15: ‖~rm+1‖ = ‖ρ~e1 − H˜m+1~y‖ 16: m = m+ 1 17: until convergence criterion is satisfied 5.6 Lanczos Orthogonalisation Procedure for Symmetric Matrices If A = AT , then the Hessenberg matrix obtained by the Arnoldi process satisfies HT = (QTAQ)T = QTATQ = QTAQ = H, 94 Chapter 5. The GMRES Method for Sparse Nonsymmetric Systems so H is symmetric, which implies that it is tridiagonal. Therefore, the Arnoldi update formula simplifies from hm+1,m~qm+1 = A~qm − h0,m~q0 − h1,m~q1 − . . .− hm,m~qm. to a three-term recursion relation hm+1,m~qm+1 = A~qm − hm−1,m~qm−1 − hm,m~qm, with A ~q0 ~q1 . . . ~qi = ~q0 ~q1 . . . ~qi ~qi+1 h0,0 h0,1 h1,0 h1,1 h1,2 0 h2,1 h2,2 h2,3 . . . . . . . . . . . . . . . 0 hi,i−1 hi,i hi+1,i , or, taking into account the symmetry further, H˜i+1 = α0 β0 β0 α1 β1 0 β1 α2 β2 . . . . . . . . . . . . . . . 0 βi−1 αi βi . The simplification of the Arnoldi procedure to compute the orthonormal basis {~q0, . . . , ~qi} of the Krylov space based on A~qi = βi−1~qi−1 + αi~qi + βi~qi+1 is called the Lanczos procedure. It can be shown that the Lanczos procedure is related to the CG algorithm (just like Arnoldi is used by GMRES). The Lanczos procedure is given by: 5.6. Lanczos Orthogonalisation Procedure for Symmetric Matrices 95 Algorithm 5.7: Lanczos Procedure for an Orthonormal Basis of Ki+1(~r0, A) Input: matrix A ∈ Rn×n, symmetric; vector ~r0 Output: vectors ~q0, . . . , ~qi that form an orthonormal basis of Ki+1(~r0, A) ρ = ‖~r0‖ ~q0 = ~r0/ρ β−1 = 0 ~q−1 = 0 for m = 0 : i− 1 do ~v = A~qm αm = ~q T m~v ~v = ~v − αm~qm − βm−1~qm−1 βm = ‖~v‖ ~qm+1 = ~v/βm end for Finally, it can be shown that the eigenvalues of the (i + 1) × (i + 1) matrix Ĥi+1 that is formed by the first i + 1 rows of H˜i+1 ∈ R(i+2)×(i+1), obtained by Arnoldi or Lanczos, provide approximations for eigenvalues of A. Indeed, when i+ 1 = n, we can consider the eigenvalue decomposition H = V ΛV −1 of H, and then AQ = QH = QV ΛV −1 implies AQV = QV Λ, i.e., the columns of QV are the eigenvectors of A, with associated eigenvalues in Λ. When i + 1 = n this relation is exact. When i n, the eigenvalues of Ĥi+1 = V ΛV −1 approximate some of the eigenvalues of A, and the columns of Qi+1V approximate the associated eigenvectors. Eigenvalue and eigenvector computation is an important topic of the second part of this unit. 96 Chapter 5. The GMRES Method for Sparse Nonsymmetric Systems Part II Eigenvalues and Singular Values Chapter 6 Basic Algorithms for Eigenvalues Eigenvalues problems and singular value decomposition are particularly inter- esting because they serve as the driving force behind many important practical problems, ranging from structural dynamics, quantum chemistry, data science, Markov chain techniques, control theory, and beyond. Numerically stable and computationally fast algorithms for identifying eigenvalues and eigenvectors are powerful and yet far from obvious to construct. 6.1 Example: Page Rank and Stochastic Matrix Before diving into the details of these jewels of computational science, we will first introduce the stochastic matrix of a Markov chain. This will be later used as an example in tutorial and assignment questions. Markov chain is widely used for studying cruise control systems in motor vehicles, queues or lines of customers arriving at an airport, exchange rates of 99 100 Chapter 6. Basic Algorithms for Eigenvalues currencies, or even modelling internet search. Here we will use page rank as the motivating example1. Step 1: A directed graph consists of a non-empty set of nodes and a set of directed edges. Nodes are indexed by natural integers, 1, 2, · · · . If there is an edge from the node i to the node j then i is often called tail, while j is called head. Each directed edge represents a possible transition from its tail to its head. Given a collection of web sites, it is reasonable to think a web site i is a node, and a hyperlink linked to another web site j forms a directed edge from i to j. This creates a directed graph. 0.5 0.5 0.4 0.6 0.3 0.3 0.3 0.1 0.4 0.5 0.1S1 S2 S3 S4 Remark Since the importance of a web site is measured by its popularity (how many incoming links it has), we can view the importance of a site i as the proba- bility that a random surfer on the Internet entering that website by following hyperlinks. Step 2: We can weigh the edge (hyperlink) of the graph in a probabilistic way: A web site i is linked to other web sites (including itself) by hyperlinks, we can count the number of occurrences of hyperlinks that are pointed to a web site j, and normalise these numbers by the total number of hyperlinks contained in site i. This way, the directed edge from i to j are weighted, and such weights can be interpreted as a discrete probability distribution. Following this probability distribution, a random surfer currently browsing web site i will enter other sites. This transition is described by a stochastic system, at each node i, the tran- sition from the node i to the node j follows certain probability. This discrete probability distribution is represented as a vector, the j-th entry of the vector represents the probability of moving to a node j. It follows several principles: • If the transition probability from i to j is 0, there is no edge started at i and ended at j, and vice versa. 1The web pages shown above above are downloaded from http://www.math.cornell.edu/ mec/Winter2009/RalucaRemus/Lecture3/lecture3.html 6.1. Example: Page Rank and Stochastic Matrix 101 • We can have a probability of staying at the current node i, this is repre- sented by an edge started and ended at the same node i. • At any give node, the sum of transition probabilities to other nodes (in- cluding the current node) must be 1. This way, we can define a stochastic matrix, also known as transition matrix or Markov matrix, to describe the transitions of a Markov chain. If we assume that there are n possible nodes, the stochastic matrix M ∈ Rn×n is a square matrix and each of its entries is a nonnegative real number representing a probability of moving from the node indexed by its row number to another node indexed by its column number. The i-th row of the matrix M is the discrete probability distribution moving from the current node i to other nodes. Example 6.1 For example, the system in the figure has the following stochastic matrix: M = 0.5 0.5 0 0 0 0.4 0 0.6 0.3 0.3 0.3 0.1 0 0.4 0.5 0.1 Note that each row of the stochastic matrix sums to 1. Step 3: How to work out the popularity of a collection of web sites given the stochastic matrix? We need to figure out the probability distribution of surfers entering this collection of web sites. A site is considered as popular if it has a high probability to be visited. At a current step k, we assume the probability of visiting the collection of web sites can be represented by a vector ~x(k). This vector is called the state of the system at time k. The probability of a web site j will be visited in the next step k+ 1 is the sum of probabilities of currently visiting a site i and then follow an edge entering into site j. This is given as ~x(k+1)(j) = n∑ i=1 ~x(k)(i)Mij , (6.1) where ~x(k)(i) is the i-th element of the vector ~x(k). This way, the probability ~x(k+1) is given as ~x(k+1)> = ~x(k)>M. (6.2) Example 6.2 Continuing with the transition matrix in the diagram, we assume all web sites have equally probabilities to be visited at the beginning, i.e., ~x(0) =[ 0.25 0.25 0.25 0.25 ]> . The probability of visiting the node 2 in the next step is given as 4∑ i=1 ~x(0)(i)M(i, 2) = ~x(0)>M(:, 2) = [ 0.25 0.25 0.25 0.25 ] 0.5 0.4 0.3 0.4 = 0.4. 102 Chapter 6. Basic Algorithms for Eigenvalues Starting with an initial probability ~v(0), after k steps, the probability distri- bution of web sites being visited is ~x(k)> = ~x(0)>Mk. (6.3) The vector ~x(k), k → represents the probability distribution of web sites being visited after a large number of visits. Example 6.3 Continuing with the transition matrix in the diagram, we assume all web sites have equally probabilities to be visited, i.e., ~x(0) = [ 0.25 0.25 0.25 0.25 ]> . Af- ter one step of the transition, it gives ~x(1)> = ~x(0)>M = [ 0.2 0.4 0.2 0.2 ] . After three steps of the transition, it gives ~x(3)> = ~x(0)>M3 = [ 0.128 0.4 0.188 0.284 ] . After five step of the transition, it gives ~x(5)> = ~x(0)>M5 = 0.11972 0.3922 0.20312 0.28496 . After ten steps of the transition, it gives ~x(10)> = ~x(0)>M10 = [ 0.12164 0.3919 0.20269 0.28378 ] . After one hundred steps of the transition, it gives ~x(100)> = ~x(0)>M100 = [ 0.12162 0.39189 0.2027 0.28378 ] . After one thousands steps of the transition, it gives ~x(1000)> = ~x(0)>M1000 = [ 0.12162 0.39189 0.2027 0.28378 ] . We can also randomly choose an initial distribution ~x(0) =[ 0.081295 0.54474 0.30791 0.066051 ]> . After one thousands steps of the transition, it gives ~x(1000)> = ~x(0)>M1000 = [ 0.12162 0.39189 0.2027 0.28378 ] . Regardless with the initial distribution, the transition seems to converge to a stationary distribution. Studying the eigenvectors and eigenvalues are crucial for identifying ~x(k), k → and understanding its behaviour. In fact, we can show that the largest eigenvalue of M is one. Under certain conditions, with large k, the vector ~x(k) converges to a vector ~x(∞). The vector ~x(∞) is called the stationary distribution. The stationary distribution is invariant under the transition defined by M , which can be stated in the form of ~x(∞)> = ~x(∞)>M. In fact, the stationary distribution is the eigenvector of M> associated with eigenvalue 1. 6.1. Example: Page Rank and Stochastic Matrix 103 Remark Here, we assume that the next state of the system only depends on it current state. This property is called the Markov property. The stochastic matrix discussed so far is often called the right stochastic matrix, as it appears in the right side of the multiplication in defining a transition. Definition 6.4 Given a right stochastic matrix Mr ∈ Rn×n, its entry Mr(i, j) defines the transition probability from the node i to the node j. Each row of M sums to 1. For computational convenience, in this note we are dealing with the transpose of the right stochastic matrix, which is often referred to as the left stochastic matrix. Definition 6.5 A left stochastic matrix Ml ∈ Rn×n is a stochastic matrix with each of its entries Ml(i, j) defines the transition probability from the node j to the node i. Each column of M sums to 1. For the same transition diagram, Ml = M > r . Example 6.6 The left stochastic matrix of the given diagram is Ml = 0.5 0 0.3 0 0.5 0.4 0.3 0.4 0 0 0.3 0.5 0 0.6 0.1 0.1 . Given an initial distribution ~x(0) = [ 0.25 0.25 0.25 0.25 ]> , the probability of visiting the node 2 in the next step is given as 4∑ i=1 Ml(2, i)~x (0)(i) = Ml(2, 🙂 ~x (0) = [ 0.5 0.4 0.3 0.4 ] 0.25 0.25 0.25 0.25 = 0.4. Given ~x(k), the probability ~x(k+1) is given as ~x(k+1) = Ml ~x (k). (6.4) The stationary distribution has the property ~x(∞) = Ml ~x(∞). (6.5) 104 Chapter 6. Basic Algorithms for Eigenvalues 6.2 Fundamentals of Eigenvalue Problems Let A ∈ Rn×n be a square matrix, a non-zero ~x ∈ C is called a eigenvector and λ ∈ is called its corresponding eigenvalue, given that A~x = λ~x. (6.6) Here we review the basic mathematics of eigenvalues and eigenvectors. 6.2.1 Notations This note does not deal with the eigenvalue problems of matrices with complex entries. However, the eigenvalues and eigenvectors of a matrix filled with real entries may not be real. Example 6.7 The matrix A = [ 0 1 −1 0 ] , has eigenvalues ±i and eigenvectors [1,±i]>/√2. We need to introduce some special matrix operations and special matrices involving complex numbers. Definition 6.8 The conjugate transpose or Hermitian transpose of a matrix A ∈ with complex entries is the m-by-n matrix A∗ obtained from A by taking the trans- pose and then taking the complex conjugate of each entry (i.e. negating their imaginary parts but not their real parts). This takes the form of A∗ij = Aji. Definition 6.9 A unitary matrix is a complex square matrix Q ∈ with its conjugate transpose Q∗ is also its inverse. That is QQ∗ = Q∗Q = I. An orthogonal matrix is unitary. We also introduce some basic matrix operations that will be used in Part II. 1. Given a matrix A, the entry on the i-th row and j-th column is denoted as Aij . 2. The k-th power of a square matrix A is denoted as Ak. 3. The superscript (k) is used to denote some variables at the k-th iteration of an algorithm. For example, A(k) is a matrix A in the k-th iteration of some algorithm. In general, A(k) is not Ak. 4. ~vi is used to denote the i-th vector in a sequence of vectors, and ~vi(j) is used to denote the j-th entry of the vector ~vi. 6.2. Fundamentals of Eigenvalue Problems 105 5. To be consistent with 4, we also use A(i, j) to denote the entry on the i-th row and j-th column of a matrix A, i.e., A(i, j) = Aij . 6. Similarly, we use A(k : l,m : n) to denote a submatrix of A—the submatrix spans the k-th to l-th rows and m-th to n-th columns of A. 6.2.2 Eigenvalue and Eigenvector Equation (6.6) can be equivalently stated as (A− λI)~x = 0. (6.7) For a given eigenvalue λ, there may exists a set of linearly independent eigen- vectors Eλ such that (A− λI)~v = 0, ∀~v ∈ Eλ. The subspace spanned by the set Eλ is called an eigenspace. The eigenspace is the nullspace of the matrix (A− λI). Equation (6.7) has a non-zero solution ~x if and only if the determinant of the matrix A− λI is zero. The determinant of the matrix A− λI can be expressed as a polynomial. Definition 6.10 The characteristic polynomial of A is a degree n polynomial in the form of pA(λ) = det(A− λI). (6.8) The eigenvalues of a matrix A are the roots of the characteristic polynomial. The fundamental theorem of algebra implies that the characteristic polyno- mial of A ∈ Rn×n, is a degree-n polynomial, can be factored as pA(λ) = (λ− λ1)(λ− λ2) · · · (λ− λn) = n∏ i=1 (λ− λi). We note that each of the roots of the characteristic polynomial, λi, can be a complex number. The roots, λ1, λ2, . . . , λn, may not all have distinct values. This leads to the concept of algebraic multiplicity of an eigenvalue. Definition 6.11 The algebraic multiplicity of an eigenvalue λi, µA(λi) is the multiplicity of λi as a root of pA(λi). An eigenvalue is simple if it has multiplicity 1. Another multiplicity of an eigenvalue λi, the geometric multiplicity, is defined by the dimension of the nullspace of (A− λiI). Definition 6.12 The geometric multiplicity of λi, µG(λi), is the number of linearly inde- pendent eigenvectors associated with λi, or the dimension of the nullspace of (A− λiI). The algebraic multiplicity of any eigenvalue of a matrix A ∈ Rn×n is always greater than or equal to the its geometric multiplicity. Later we will prove this. 106 Chapter 6. Basic Algorithms for Eigenvalues Example 6.13 Consider two matrices A = a a a , B = a 1a 1 a , (6.9) where a > 0. Both A and B have the same characteristic polynomial (a− 1)3. A has three linearly independent eigenvectors, where as B only has one, namely the scalar multiplication of ~e1. Definition 6.14 A defective eigenvalue is an eigenvalue whose algebraic multiplicity exceeds its geometric multiplicity. A matrix A ∈ Rn×n is called a defective matrix if it has one or more defective eigenvalues. 6.2.3 Similarity Transformation Definition 6.15 If a matrix X ∈ is nonsingular, then the map A→ X−1AX, is called a similarity transformation. Two matrices A ∈ and B ∈ are called similar if there exist a nonsingular matrix X ∈ such that B = X−1AX. Similar matrices A and X−1AX share many important properties. Theorem 6.16 Given a matrix A ∈ and a nonsingular matrix X ∈, A and X−1AX have the same characteristic polynomial, eigenvalues, and algebraic and geometric multiplicities. Proof. By the definition of characteristic polynomial, we have pX−1AX(λ) = det(X −1AX − λI) = det(X−1(A− λI)X) = det(X−1) det(A− λI) det(X) = det(A− λI) = pA(λ) Since A andX−1AX have the same characteristic polynomial, the agreement of eigenvalues and algebraic multiplicity follows. The dimension of the nullspace of (A − λI) is the same as that of X−1(A − λI)X are identical because X is nonsingular, and thus the agreement of geometric multiplicity follows. 6.2. Fundamentals of Eigenvalue Problems 107 Similarity transformation can be used to show connections between algrbric multiplicity and geometric multiplicity. Theorem 6.17 The algebraic multiplicity of any eigenvalue of a matrix A ∈ Rn×n is always greater than or equal to the its geometric multiplicity. Proof. Suppose an eigenvalue with geometric multiplicity r has r linearly independent eigenvectors, ~v1, . . . , ~vr, forming a matrix Vr = ~v1 · · · ~vr , such that AVr = λVr. We can extending Vr to a unitary matrix V = [Vr|V⊥]. Applying the similarity transformation V ∗AV , we obtain B = V ∗AV = [ λIr C 0 D ] , where Ir ∈ is an identity matrix, C ∈r×(n−r), and D ∈(n−r)×(n−r). Since A and B have the same characteristic polynomial, we can then expressed the characteristic polynomial of A as pA(z) = det(B−zI) = det(zIr−λIr) det(zIn−r−D) = (z−λ)r det(zIn−r−D). Thus the algebraic multiplicity of λ is greater than or equal to r. 6.2.4 Eigendecomposition, Diagonalisation, and Schur Factorisation For a matrix A ∈ Rn×n that is non-defective, i.e., the algebraic multiplicity and the geometric multiplicity of each of its eigenvalue are the same. We have AV = V Λ, (6.10) where V = ~v1 ~v2 · · · ~vn , Λ = λ1 λ2 . . . λn are the eigenvectors and corresponding eigenvalues. This effective factorise the matrix A in the form of A = V ΛV −1. (6.11) This similarity transformation effectively diagonalise the matrix A. In fact, it is easy to verify that a diagonal matrix is non-defective. 108 Chapter 6. Basic Algorithms for Eigenvalues Theorem 6.18 A matrix A ∈ Rn×n is non-defective if and only if it has an eigenvalue decom- position A = V ΛV −1. Proof. Given an eigenvalue decomposition A = V ΛV −1, we know that A and Λ are similar. Since the diagonal matrix Λ is non-defective, A is non-defective by Theorem 6.7. A non-defective matrix must have n linearly independent eigenvectors, because (1) the number of linearly independent eigenvector associated with each eigen- value is equal to its algebraic multiplicity; and (2) eigenvectors associated with different eigenvalues are linearly independent. Thus, the resulting matrix V formed by all the eigenvectors is nonsingular. A matrix A is unitarily diagonalisable, that is, there exists a unitary matrix Q such that A = QΛQ∗. Real symmetric matrices are special matrices that are orthogonally diagonalis- able. This leads to many computational advantages in finding their eigenvalues. Remark 6.19 A real symmetric matrix is orthogonally diagonalisable and its eigenvalues are real. That is, both Q and Λ are real for a symmetric A. Not every matrix is unitarily diagonalisable. Furthermore, deflective matrices are even not diagonalisable. A more general matrix decomposition is the Schur factorisation. Definition 6.20 A Schur factorisation of a matrix A ∈ Rn×n takes the form A = QTQ∗, where T is upper-triangular and Q is unitary. Theorem 6.21 Every square matrix A ∈ Rn×n has a Schur factorisation. Proof. The case n = 1 is trivial as A is a scalar. Suppose n ≥ 2. Let ~x be any eigenvector of A with corresponding eigenvalue λ. Take ~x be normalised and let it be the first column of a unitary matrix U in the form of U = [~x |U2] , where U2 ∈ R(n−1)×n. The product U∗AU has the form U∗AU = [ ~x∗A~x ~x∗AU2 U∗2A~x U ∗ 2AU2 ] . 6.2. Fundamentals of Eigenvalue Problems 109 Since ~x∗A~x = λ and U∗2A~x = λU ∗ 2 ~x = 0. Let C = ~x ∗AU2 and D = U∗2AU2, the product can be simplified to U∗AU = [ λ C 0 D ] . By induction, there exists a Schur factorisation V TV ∗ of the lower dimensional matrix D. Then write the unitary matrix Q = [ 1 0 0 V ] , and we have (Q∗U∗)A(UQ) = [ λ CV 0 T ] . Since UQ is unitary and (Q∗U∗)A(UQ) is upper triangular, we obtain the Schur factorisation. Theorem 6.22 The eigenvalues of a triangular matrix are the entries on its main diagonal. Proof. Given an upper triangular matrix T ∈ Rn×n. The characteristic polynomial of T can be written as pT (λ) = det(T − λI). We can partition T in the following form T − λIn = T2 − λIn−1 ~t1 ~0> T11 − λ , where T2 − λIn−1 is also upper triangular. Using the property of the determi- nant of block matrices, we have det(T − λI) = det(T2 − λIn−1) det(T11 − λ−~0>(T2 − λIn−1)−1~t1) = det(T2 − λIn−1)(T11 − λ). Since T2− λI is also upper triangular, repeatedly applying this procedure will leads to the characteristic polynomial pT (λ) = n∏ i=1 (Tii − λ). Therefore, the eigenvalues of a triangular matrix are the entries on its main diagonal. 110 Chapter 6. Basic Algorithms for Eigenvalues Remark 6.23 In summary, we have the following important results for identifying eigenvalues of a matrix. 1. A matrix A is nondefective if and only if there exists an eigenvalue de- composition A = V ΛV −1. 2. For a symmetric matrix A, there exists an orthogonal diagonalisation A = QΛQ∗. 3. A unitary triangularisation (Schur factorisation) always exists A = QTQ∗. Theorem 6.24 A real square matrix is symmetric if and only if it has the eigendecomposition A = QΛQ>, where Q is a real orthogonal matrix and Λ is a real diagonal matrix whose entries are the eigenvalues of A. Proof. (The “only if” part =⇒ ): From Theorem 6.21 we know that a general square matrix has the Schur decomposition A = QTQ>, where T is upper triangular. This way we have T = QAQ>. For a symmetric matrix A, the matrix T should be also symmetric. A symmetric upper triangular matrix must be diagonal. This leads to the decomposition A = QTQ> where T is diagonal, which is an eigendecomposition. Furthermore, all the eigenvalues and eigenvectors of a real symmetric matrix are real. (The “if” part ⇐= ): Given the eigendecomposition A = QΛQ> with a real orthogonal Q and real diagonal Λ, A is a real symmetric matrix since Λ is symmetric. Therefore the result follows. 6.2.5 Extending Orthogonal Vectors to a Unitary Matrix In the proofs in the previous subsection, one important step is extending a rect- angular matrix Vr = ~v1 · · · ~vr , where Vr ∈, to a unitary matrix V = Vr V⊥ , 6.2. Fundamentals of Eigenvalue Problems 111 where V⊥ ∈n×(n−r). Here we explain the details of this operation. For a given matrix A ∈ Rn×n, suppose it has an eigenvalue λ with geometric multiplicity r. This way, the eigenvalue λ has r linearly independent eigenvectors, i.e., A~ui = λ~ui, i = 1, . . . , r. Furthermore, we can show that a sequence of orthonormal eigenvectors {~v1, ~v2, · · · , ~vr} can be obtained by orthogonalising and normalising this set of eigenvectors {~u1, ~u2, · · · , ~ur}—using either Gram-Schmidt or Householder reflection. The vectors {~v1, ~v2, · · · , ~vr} are still in the null space of A−λI (the eigenspace of λ) as they are linear combinations of {~u1, ~u2, · · · , ~ur}, and thus are eigenvectors. This forms the matrix Vr. As in the QR factorisation, we can always construct another n-by-(n-r) or- thonormal matrix V⊥ = ~vr+1 · · · ~vn , such that each column of V⊥ is orthogonal to all the columns of Vr. Since both Vr and V⊥ are orthonormal, the matrix V = [Vr|V⊥] is a unitary matrix. Now we have V ∗AV = [ V ∗r V ∗⊥ ] A [ Vr V⊥ ] = [ V ∗r AVr V ∗ r AV⊥ V ∗⊥AVr V ∗ ⊥AV⊥ ] . Since AVr = λVr and V⊥ is orthogonal to Vr, the above equation can be written as V ∗AV = [ λIr C 0 D ] , where C = V ∗r AV⊥ and D = V ∗ ⊥AV⊥. The resulting matrix V ∗AV is upper block triangular. Similarly, we can construct another matrix U = V⊥ Vr , and repeat the above process. This leads to U∗AU = [ V ∗⊥ V ∗r ] A [ V⊥ Vr ] = [ V ∗⊥AV⊥ V ∗ ⊥AVr V ∗r AV⊥ V ∗ r AVr ] = [ D 0 C λIr ] , with the same C and D defined above. The resulting matrix V ∗AV is lower block triangular. 112 Chapter 6. Basic Algorithms for Eigenvalues 6.3 Power Iteration and Inverse Iteration Given a matrix A ∈ Rn×n, we recall that eigenvalues are the roots of the char- acteristic polynomial pA(λ) = det(A − λI). In principle, this characteristic polynomial has degree n. For a polynomial with degree 2 or 3, well established formulas can be used to find its roots. However, as shown by Abel, Galois, and others in nineteenth century, a degree n ≥ 5 polynomial in the form of p(λ) = a0 + n∑ i=1 aiλ i, where each coefficient ai is a rational number, its roots can not be obtained by algebraic expressions—addition, subtraction, multiplication, and division. This suggests that we are not able to have direct solvers for finding eigenvalues of general matrices. Remark 6.25 As many root finding algorithms, eigenvalue solvers must be iterative. 6.3.1 Power Iteration A straightforward idea is that the sequence ~b ‖~b‖ , A~b ‖A~b‖ , A2~b ‖A2~b‖ , · · · , A k~b ‖Ak~b‖ converges to an eigenvector corresponding to the largest eigenvalue (in absolute value) of the matrix A. This is called the power iteration. It can be formalised as the following: Algorithm 6.26: Power Iteration Input: Matrix A ∈ Rn×n and an initial vector ~b(0) = ~x ∈ Rn, where ‖~x‖ = 1 Output: An eigenvalue λ(m) and its eigenvector ~b(m) 1: for k = 1, 2, . . . ,m do 2: ~t(k) = A~b(k−1) . Apply A 3: ~b(k) = ~t(k)/‖~t(k)‖ . Normalise 4: λ(k) = ( ~b(k) )∗ ( A~b(k) ) . Estimate eigenvalue 5: end for Repeatedly apply Steps 2 and 3, the vectors ~b(k), k = 0, 1, . . . ,m follows the sequence ~x ‖~x‖ , A~x ‖A~x‖ , A2~x ‖A2~x‖ , · · · , Am~x ‖Am~x‖ . Suppose ~b(k) is an eigenvector of A, then we have A~b(k) = λ(k)~b(k). As ~b(k) is normalised, multiplying both sides by ( ~b(k) )∗ leads to the ratio λ(k) = ( ~b(k) )∗ ( A~b(k) ) ( ~b(k) )∗ ~b(k) = ( ~b(k) )∗ ( A~b(k) ) . (6.12) 6.3. Power Iteration and Inverse Iteration 113 Definition 6.27 The ratio r(~b) = ~b ∗A~b ~b ∗~b , (6.13) can be understood as: given a direction ~b, what scalar λ acts most like an eigen- value for ~b, in the sense of minimising the f(λ) = ‖A~b−λ~b‖2. By differentiate this term w.r.t. λ, we have ∂f ∂λ = ∂‖A~b− λ~b‖2 ∂λ = −2~b ∗(A~b− λ~b). At λ such that ∂f∂λ = 0, f(λ) has the local minima (as the second derivative is 2~b ∗~b = 2‖~b‖2), and thus we have λ = r(~b) as defined above. For a symmetric matrix A ∈ Rn×n, this ratio if called Rayleigh quotient. 6.3.2 Convergence of Power Iteration We want to show the convergence of the power iteration in two aspects. We first show that the sequence ~b(k) converges linearly to an eigenvector corresponding to the largest eigenvalue. Then we prove that for an estimated eigenvector, the estimated eigenvalue given by the ratio (6.13) converges linearly to corresponding eigenvector. Theorem 6.28 Assume a matrix A ∈ Rn×n is non-defective. Suppose its eigenvalues are ordered so that |λ1| ≥ |λ2| ≥ | · · · ≥ |λn|. Let ~v1, . . . , ~vn denote (normalised) eigenvectors corresponding to each of the eigenvalues. Suppose further we have an initial vector ~b(0) = ~x such that ~x∗~v1 6= 0. Then the vector ~b(k) in the power iteration satisfies ‖~b(k) − (±~v1)‖ = O (∣∣∣∣λ2λ1 ∣∣∣∣k ) , as k →∞. The ± represents one or other choice of the sign is to be taken. Proof. We represent ~x as a linear combination of all the (normalised) eigen- vectors ~v1, . . . , ~vn, which takes the form of ~x = n∑ i=1 ai ~vi. 114 Chapter 6. Basic Algorithms for Eigenvalues Let V = ~v1 ~v2 · · · ~vn , Λ = λ1 λ2 . . . λn , and ~a = a1 a2 … an , we have A = V ΛV −1 and ~x = V~a, and hence ~b(k) = c(k)Ak~x = c(k)V ΛkV −1V~a = c(k)V Λk~a = c(k) n∑ i=1 λki ai ~vi, where c(k) brings the vector ~b(k) normalised. Now we bring λk1 to the outside of the summation in the form of ~b(k) = c(k)λk1 ( n∑ i=1 λki λk1 ai ~vi ) = c(k)λk1a1 ~v1 + c (k)λk1 n∑ i=2 λki λk1 ai ~vi. Therefore, the convergence of ~b(k) to ~v1 is dominated by the rate that each of λki λk1 vanish, which is on the order of ∣∣∣λ2λ1 ∣∣∣k. Theorem 6.29 Assume a non-symmetric matrix A ∈ Rn×n is non-defective. Suppose λK is an eigenvalue of A with an eigenvector ~vK . The ratio r(~b) = ~b ∗A~b ~b ∗~b , is a linearly accurate estimate of the eigenvalue λK : |r(~b)− λK | = O(‖~b− ~vK‖), as ~b→ ~vK . Proof. We represent~b as a linear combination of all the eigenvectors ~v1, . . . , ~vn, which takes the form of ~b = n∑ i=1 ai ~vi. As defined in the previous proof, we have A = V ΛV −1 and ~b = V~a, and hence A~b = V ΛV −1V~a = V Λ~a = n∑ i=1 λi ai ~vi. 6.3. Power Iteration and Inverse Iteration 115 This way the ratio r(~b) can be written as r(~b) = ( ∑n i=1 λi ai ~vi) ∗~b ~b∗~b Thus, the error in the eigenvalue estimate takes the form r(~b)− λK = ( ∑n i=1 λi ai ~vi) ∗~b ~b∗~b − λK ~b∗~b ~b∗~b = ( ∑n i=1 λi ai ~vi) ∗~b− λK ( ∑n i=1 ai ~v ∗ i ) ~b ~b∗~b = (∑n i 6=K(λi − λK) ai ~v∗i~b ) ~b∗~b . Now, we can express the error as a weighted sum of ai for i 6= K, which is in the form of r(~b)− λK = n∑ i6=K ai wi, where wi = (λi − λK)~v∗i~b ~b∗~b . Given ~b = aK ~vK + ∑n i 6=K ai ~vi, if ~b is close to ~vK , each ai for i 6= K is on the order of ~b−~vK . Therefore, r(~b) converges linearly to the eigenvalue λK as ~b→ ~vK . Power iteration by itself can be slow. For example, it does not converge if |λ1| = |λ2|. Nevertheless, it serves as a basis for many powerful eigenvalue algorithms we will explore in later section. It also reveals the iterative nature of eigenvalue solvers. 6.3.3 Shifted Power Method We have observed that if the first and second largest eigenvalues (in their absolute value) are close, the power iteration suffers from slow convergence. One simple yet power idea to handle is situation is using a shifted matrix A+ σI. Theorem 6.30 If λ is an eigenvalue of A, then λ+µ is an eigenvalue of A+µI. Furthermore, if ~v is an eigenvector of A associated with λ, ~v is also an eigenvector of A+µI associated with λ+ µ. Using the shifted matrix A+ σI, we can enhance the ratio between the first and second largest eigenvalues. 6.3.4 Inverse Iteration There also exists alternative ways to enhance the ratio between eigenvalues. 116 Chapter 6. Basic Algorithms for Eigenvalues Theorem 6.31 Suppose µ is not an eigenvalue of A ∈ Rn×n, the eigenvectors of (A−µI)−1 are the same as A, and the corresponding eigenvalues are (λi − µ)−1, i = 1, . . . , n, where λi, i = 1, . . . , n are eigenvalues of A. This theorem suggests that if we choose a µ that is close to an eigenvalue λK . Then the eigenvalue (λK−µ)−1 may be much larger that other eigenvalues, (λi−µ)−1, i 6= K, of the matrix (A−µI)−1. This leads to the inverse iteration. Algorithm 6.32: Inverse Iteration Input: Matrix A ∈ Rn×n, an initial vector ~b(0) = ~x ∈ Rn where ‖~x‖ = 1, and a shift scalar µ ∈ R. Output: An eigenvalue λ(m) and its eigenvector ~b(m) 1: for k = 1, 2, . . . ,m do 2: Solve (A− µI)~w(k) = ~b(k−1) for ~w(k) . Apply (A− µI)−1 3: ~b(k) = ~w(k)/‖~w(k)‖ . Normalise 4: λ(k) = ( ~b(k) )∗ ( A~b(k) ) . Estimate eigenvalue 5: end for 6.3.5 Convergence of Inverse Iteration Theorem 6.33 Given a nondefective matrix A ∈ Rn×n, suppose λK is the closest eigenvalue to µ and λL is the second closest, that is, |λK − µ| < |λL − µ| ≤ |λi − µ|, for each i 6= K. Let ~v1, . . . , ~vn denote eigenvectors corresponding to each of the eigenvalues of A. Suppose further we have an initial vector ~x such that ~x∗~vK 6= 0. Then the vector ~b(k) in the inverse iteration satisfies ‖~b(k) − (±~vK)‖ = O (∣∣∣∣λK − µλL − µ ∣∣∣∣k ) , and the estimated eigenvalue λ(k) satisfies |λ(k) − λK | = O (∣∣∣∣λK − µλL − µ ∣∣∣∣k ) . Proof. Using Theorem 6.31, we can show that the matrix B = (A − µI)−1 has eigenvalues zi = (λi − µ)−1, i = 1, . . . , n associated with (normalised) eigenvectors ~v1, . . . , ~vn. Note that the eigenvalues are ordered as |zK | > |zL| ≥ |zi| for each i 6= K. 6.3. Power Iteration and Inverse Iteration 117 Using the same argument in the proof of Theorem 6.28, we can show that ‖~b(k) − (±~vK)‖ = O (∣∣∣∣ zLzK ∣∣∣∣k ) = O (∣∣∣∣λK − µλL − µ ∣∣∣∣k ) . Since the estimated eigenvector ~b(k) converges to ±~vK on the order of O (∣∣∣ zLzK ∣∣∣k). Applying Theorem 6.29, we can show that λ(k) = r(~b(k)) sat- isfies |λ(k) − λK | = O (∣∣∣∣λK − µλL − µ ∣∣∣∣k ) . Remark 6.34 Step 2 of the inverse iteration relies on solving a linear system that is exceed- ingly ill-conditioned. Will this create any fatal flaw in the algorithm? Fortunately this does not introduce fatal flaw if the linear system is solved by some stable methods. Step 2 of the inverse iteration solves (A− µI)~w(k) = ~b(k−1) for ~w(k). Suppose µ is close to an eigenvalue λJ with an eigenvector ~vJ . Using Theorem 6.30, we can show that the matrix C = A−µI has eigenvalues σi = λi − µ, i = 1, . . . , n associated with (normalised) eigenvectors ~v1, . . . , ~vn. Given a diagonal matrix D where Dii = σi, the matrix C has the similarity transformation C = V DV −1. We can express the right hand side vector ~b(k−1) as a linear combination of eigenvectors,~b(k−1) = V~a. This way, ~w(k) = C−1~b(k−1) can be written as ~w(k) = V D−1~a = ~a(J) λJ − µ~vJ + n∑ i6=J ~a(i) λi − µ~vi. (6.14) This leads to the desired eigenvector we want to approximate if µ is close to λJ . Now we deal with the ill-conditioning part. We want to examine the stability of ~w(k) given a small perturbation to C and ~b(k−1). (C + δC)(~w(k) + δ ~w) = ~b(k−1) + δ~b. The left hand side takes the form of (C + δC)(~w(k) + δ ~w) = C ~w(k) + Cδ ~w + δC ~w(k) + δCδ ~w. Since the double perturbation term δCδ ~w can be neglected and C ~w(k) = ~b(k−1), we have δ ~w = −C−1(δC ~w(k) + δ~b). Without loss of generality, we can express (δC ~w(k) + δ~b) as a linear combination of eigenvectors, (δC ~w(k) + δ~b) = V ~d. Using the eigendecomposition of C, we have δ ~w = V D−1 ~d = ~d(J) λJ − µ~vJ + n∑ i 6=J ~d(i) λi − µ~vi. (6.15) 118 Chapter 6. Basic Algorithms for Eigenvalues If µ is close to λJ , the perturbation to the solution, δ ~w also lies along the desired eigenvector we want to approximate. Therefore, as far as the linear system is solved by a stable method (for exam- ple, LU with pivoting) that can produce a solution ~w + δ ~w, both ~w and ~w + δ ~w closely lies along the same direction ~vJ . One step of normalisation will resolve the difference in size. 6.4. Symmetric Matrices and Rayleigh Quotient Iteration 119 6.4 Symmetric Matrices and Rayleigh Quotient Iteration In this section, we focus on applying the power iteration and the inverse iteration to symmetric matrices. The convergence of eigenvalue estimates of symmetric matrices exhibit a higher speed of convergence compare with that of unsym- metric matrices. We will also show a new algorithm that combines eigenvalue estimation using Rayleigh quotient with the inverse iteration to further enhance the converence speed. 6.4.1 Rate of Convergence Definition 6.35 Suppose we have a sequence y(1), y(2), . . . , y(k) converges to a number y. We say the sequence converges linearly to y if lim k→∞ |y(k+1) − y| |y(k) − y| = σ, for some sigma ∈ (0, 1). If the sequence converges with an iteration dependent σk ∈ (0, 1) lim k→∞ |y(k+1) − y| |y(k) − y| = σk. We say the sequence converges superlinearly to y if σk → 0 as k →∞. We say the sequence converges sublinearly to y if σk → 1 as k →∞. An alternative way of viewing this is to look at the error in the logarithmic scale: log(|y(k+1) − y|)− log(|y(k) − y|) = log(σk). If log(σk) < 0 is a constant, then the logarithmic of the error decreases linearly. If log(σk)→ −∞ as k →∞, then the error decreases superlinearly. If log(σk)→ 0 as k →∞, then the error decreases sublinearly. Definition 6.36 Suppose we have a sequence y(1), y(2), . . . , y(k) converges to a number y. We say the sequence converges with order q to y if |y(k+1) − y| |y(k) − y|q = γ, for some γ > 0. For example, q = 2 gives quadratic convergence. Using the logarithmic scale: lim k→∞ log(|y(k+1) − y|)− q log(|y(k) − y|) = log(γ). 6.4.2 Power Iteration and Inverse Iteration for Symmetric Matrices Recall the Rayleigh quotient r(~b) = ~b ∗A~b ~b ∗~b , (6.16) 120 Chapter 6. Basic Algorithms for Eigenvalues for estimating eigenvalues given a vector ~b. Now we want to assess the accuracy of this eigenvalue estimate for symmetric matrices. Theorem 6.37 Given a symmetric matrix A ∈ Rn×n. Suppose λK is an eigenvalue of A with an eigenvector ~qK . The ratio r(~b) = ~b ∗A~b ~b ∗~b , is a quadratically accurate estimate of the eigenvalue λK : |r(~b)− λK | = O(‖~b− ~vK‖2), as ~b→ ~vK . Proof. Since a symmetric matrix A has an eigendecomposition A = QΛQ∗, where Q is an orthogonal matrix and Λ is a diagonal matrix. Each diagonal entry of λi = Λii is an eigenvalue of A, and the corresponding i-th column of Q, ~qi = Q(:,i) is an eigenvector with eigenvalue λi. We represent ~b as a linear combination of all the eigenvectors ~q1, . . . , ~qn, which takes the form of ~b = n∑ i=1 ai ~qi, or ~b = Q~a. Now we have ~b∗A~b = ~a∗Q∗QΛQ∗Q~a = ~a∗Λ~a = n∑ i=1 λi a 2 i , since Q is orthogonal. This way the ratio r(~b) can be written as r(~b) = ∑n i=1 λi a 2 i ~b∗~b Thus, the error in the eigenvalue estimate takes the form r(~b)− λK = ∑n i=1 λi a 2 i ~b∗~b − λK ∑n i=1 a 2 i ~b∗~b = ∑n i 6=K(λi − λK) a2i ~b∗~b . Now, we can express the error as a weighted sum of a2i for i 6= K, which is in the form of r(~b)− λK = n∑ i 6=K a2i wi, where wi = λi − λK ~b∗~b . Given ~b = aK ~qK + ∑n i 6=K ai ~qi, if ~b is close to ~qK , each ai for i 6= K is on the order of ~b − ~qK , and hence a2i = O(‖~b − ~qK‖2) for i 6= K. Therefore, r(~b) converges quadratically to the eigenvalue λK as ~b→ ~qK . 6.4. Symmetric Matrices and Rayleigh Quotient Iteration 121 Not surprisingly, applying the power iteration 6.37 to a symmetric matrix, it will have linear convergence in the eigenvector estimate and quadratic conver- gence in the eigenvalue estimate, given that the ratio between first and second largest eigenvalue is not 1. A similar result holds for the inverse iteration as well. Theorem 6.38 Given a symmetric matrix A ∈ Rn×n. Suppose its eigenvalues are ordered so that |λ1| ≥ |λ2| ≥ | · · · ≥ |λn|. Let ~q1, . . . , ~qn denote (normalised) eigenvectors corresponding to each of the eigenvalues. Suppose further we have an initial vector ~b(0) = ~x such that ~x∗~q1 6= 0. Then the vector ~b(k) in the power iteration converges as ‖~b(k) − (±~qK)‖ = O (∣∣∣∣λ2λ1 ∣∣∣∣k ) , and the estimated eigenvalue λ(k) converges as |λ(k) − λK | = O (∣∣∣∣λ2λ1 ∣∣∣∣2k ) . Theorem 6.39 Given a symmetric matrix A ∈ Rn×n, suppose λK is the closest eigenvalue to µ and λL is the second closest, that is, |λK − µ| < |λL − µ| ≤ |λi − µ|, for each i 6= K. Let ~q1, . . . , ~qn denote eigenvectors corresponding to each of the eigenvalues of A. Suppose further we have an initial vector ~x such that ~x∗~qK 6= 0. Then the vector ~b(k) in the inverse iteration converges as ‖~b(k) − (±~qK)‖ = O (∣∣∣∣λK − µλL − µ ∣∣∣∣k ) , and the estimated eigenvalue λ(k) converges as |λ(k) − λK | = O (∣∣∣∣λK − µλL − µ ∣∣∣∣2k ) . 6.4.3 Rayleigh Quotient Iteration Once given a good estimate of eigenvalue, the inverse iteration demonstrates great speed in finding the eigenvector, while the Rayleigh quotient estimates the eigenvalue for a given vector. It is natural to combine both ideas. This leads to the Rayleigh Quotient Iteration. 122 Chapter 6. Basic Algorithms for Eigenvalues Algorithm 6.40: Rayleigh Quotient Iteration Input: Matrix A ∈ Rn×n and an initial vector ~b(0) = ~x ∈ Rn where ‖~x‖ = 1. Output: An eigenvalue λ(m) and its eigenvector ~b(m) 1: λ(0) = ( ~b(0) )∗ ( A~b(0) ) 2: for k = 1, 2, . . . ,m do 3: Solve (A− λ(k−1)I)~w(k) = ~b(k−1) for ~w(k) . Apply (A− λ(k−1)I)−1 4: ~b(k) = ~w(k)/‖~w(k)‖ . Normalise 5: λ(k) = ( ~b(k) )∗ ( A~b(k) ) . Estimate eigenvalue 6: end for In the Rayleigh quotient iteration, we first have an estimate of the eigenvalue for the initial vector. Then, in each iteration, we feed the estimated eigenvalue from the previous step into the shifted inverse iteration (for estimating eigenvec- tor). This leads to spectacular convergence. Theorem 6.41 Given a symmetric matrix A ∈ Rn×n. Suppose the initial vector is close to an eigenvector ~qK corresponding to an eigenvalue λK . Then the vector ~b (k) in the Rayleigh quotient iteration converges cubically as ‖~b(k+1) − (±~qK)‖ = O (∣∣∣‖~b(k) − (±~qK)‖∣∣∣3) , and the estimated eigenvalue λ(k) converges cubically as |λ(k+1) − λK | = O ( |λ(k) − λK |3 ) . Note the ± sign on both sides are not necessarily the same in above equations. Proof. Here we employ a rather restrictive assumption that the eigenvalue λK is simple. Let ‖~b(k) − (±~qK)‖ = , and for sufficiently small , using Theorem 6.37 we can show that |λ(k) − λK | = O(2). Now consider taking one step of the inverse iteration, the error of eigenvector estimates in adjacent steps can be written as ‖~b(k+1) − (±~qK)‖ ‖~b(k) − (±~qK)‖ = O (∣∣∣∣λK − λ(k)λL − λ(k) ∣∣∣∣) . Since |λ(k) − λK | = O(2) and the right hand side of the above equation is on the order of λK − λ(k), we have ‖~b(k+1) − (±~qK)‖ = O(‖~b(k) − (±~qK)‖2) = O(3). This completes the proof of the first equation (convergence of the eigenvector estimate is cubic). In the eigenvalue estimate at step k+ 1, since the Rayleigh quotient is quadratically accurate, we have |λ(k+1) − λK | = O(‖~b(k+1) − (±~qK)‖2) = O(6). 6.4. Symmetric Matrices and Rayleigh Quotient Iteration 123 Compare to the accuracy of the eigenvalue estimate at step k, which is O(2), we can conclude that the second equation (convergence of the eigenvector es- timate is cubic) also holds. With a similar reasoning, we can show that the Rayleigh quotient iteration converges quadratically on non-symmetric matrices. 6.4.4 Summary of Power, Inverse, and Rayleigh Quotient Iterations The convergence of the power, inverse, and Rayleigh quotient iterations can be summarised in the Table 6.1. We note that the Rayleigh quotient iteration may Table 6.1: Let a = ∣∣∣λ2λ1 ∣∣∣ and b = ∣∣∣λK−µλL−µ ∣∣∣ as defined in the power iteration and the inverse iteration. Symmetric matrices Non-symmetric matrices Eigenvector Eigenvalue Eigenvector Eigenvalue Power Linear O(ak) Linear O(a2k) Linear O(ak) Linear O(ak) Inverse Linear O(bk) Linear O(b2k) Linear O(bk) Linear O(bk) Rayleigh Cubic Cubic Quadratic † Quadratic † not always converge for non-symmetric matrices, the quadratic convergence can be only obtained in limited cases. In terms of operations counts, the power iteration requires O(n2) flops per iteration for handling matrix-vector products. The inverse and Rayleigh quo- tient iterations require solving a linear system for eigenvector estimation and an additional matrix-vector-product for eigenvalue estimation. For a general dense matrix, these two operations require O(n3) and O(n2) flops, respectively. In general, if we can transform the input matrix into a reduced form, namely a tridiagonal matrix (for the symmetric case) or a Hessenberg matrix (for the general case), the order of the operations counts may be greatly reduced. 124 Chapter 6. Basic Algorithms for Eigenvalues Chapter 7 QR Algorithm for Eigenvalues Many general purpose eigenvalue solvers are based on the Schur factorisation. Recall that the Schur factorisation of a matrix A ∈ Rn×n takes the form A = QTQ∗, where T is upper-triangular and Q is unitary. The eigenvalues of T , and hence the eigenvalues of A, are the entries on the main diagonal of T . We aim to construct a sequence of elementary unitary transformationsQ∗kAQk, so the product Q∗k · · ·Q∗2Q∗1 AQ1Q2 · · ·Qk (7.1) converges to an upper triangular matrix T as k →∞. Effectively, we construct a unitary matrix Q in the form of Q = Q1Q2 · · ·Qk, in this process. For a real symmetric matrix A ∈ Rn×n, let each Qk ∈ Rn×n to be an orthogonal (real) matrix, then Q∗k · · ·Q∗2Q∗1 AQ1Q2 · · ·Qk should also be symmetric and real. Therefore, the same algorithm should produce an upper- triangular and symmetric matrix T , which is diagonal. 7.1 Two Phases of Eigenvalue Computation Definition 7.1 Hessenberg matrix is a nearly triangular square matrix. An upper Hessen- berg matrix has zero entries below the first subdiagonal, and a lower Hessenberg matrix has zero entries above the first superdiagonal, as shown below. ××××× ××××× ×××× ××× ×× Upper Hessenberg ×× ××× ×××× ××××× ××××× Lower Hessenberg 125 126 Chapter 7. QR Algorithm for Eigenvalues The sequence (7.1) is usually split into two phases. In the first phase, a matrix is transformed to an upper Hessenberg matrix by a direct method. In the second phase, an iterative process (as described earlier on) is applied to transform the Hessenberg matrix to an upper triangular matrix. The process looks like the following: ××××× ××××× ××××× ××××× ××××× A 6=A∗ Phase 1−−−−−→ Q∗0AQ0 ××××× ××××× ×××× ××× ×× Hessenberg Phase 2−−−−−→ Q∗AQ ××××× ×××× ××× ×× × T For a real symmetric matrix, Phase 1 will produce an upper Hessenberg and symmetric matrix, which is tridiagonal. Phase 2 will produce a diagonal matrix as previously discussed. ××××× ××××× ××××× ××××× ××××× A=A∗ Phase 1−−−−−→ Q∗0AQ0 ×× ××× ××× ××× ×× Tridiagonal Phase 2−−−−−→ Q∗AQ × × × × × T Phase 1 uses a direct method that has the operation count comparable to QR or LU factorisation. By transforming the matrix to an upper Hessenberg or tridiagonal matrix, the operation count of matrix factorisations can be reduced by utilising the structure of Hessenberg or tridiagonal matrix. This fact can be used to greatly reduce the operation count of the iterative process in Phase 2. 7.2. Hessenberg Form and Tridiagonal Form 127 7.2 Hessenberg Form and Tridiagonal Form To compute the Schur factorisation A = QTQ ∗, we would like to apply unitary similarity transformation to A so that zeros below diagonal can be introduced. ××××× ××××× ××××× ××××× ××××× A=QTQ ∗ Q ∗AQ−−−−→ ××××× ×××× ××× ×× × T=Q ∗AQ The first thought could be applying the Householder reflection to create such a unitary Q that triangularise the matrix A, as in the QR factorisation case: ××××× ××××× ××××× ××××× ××××× A=QR Q ∗A−−−→ ××××× ×××× ××× ×× × R=Q ∗A . However, this does not work in general. Example 7.2 Consider the following symmetric matrix A, A = 34 47 5 18 26 47 10 13 26 34 5 13 26 39 47 18 26 39 42 5 26 34 47 5 18 . It has a QR factorisation A = QR, where Q = −0.51315 0.50931 0.68177 −0.097865 −0.053739 −0.70936 −0.69517 −0.032877 −0.097865 −0.053739 −0.075464 0.22987 −0.36671 −0.59133 −0.67625 −0.27167 0.29808 −0.42781 −0.39282 0.70711 −0.39241 0.34006 −0.46541 0.69056 −0.19208 , and R = −66.2571 −52.5982 −42.7879 −43.9953 −49.4287 0 39.2866 27.0942 14.2779 8.0216 0 0 −45.1121 −23.1797 −11.1436 0 0 0 −40.4136 −23.1985 0 0 0 0 −34.9301 . However, the resulting unitary similarity transformation defined by Q is Q ∗AQ = −14.9985 91.9334 68.0421 2.957 7.1403 36.6527 −30.0301 −14.7725 0.18308 9.0688 −27.8888 4.3505 37.7858 20.5252 7.1294 5.2017 5.2017 39.5858 −0.52871 −23.4519 1.8771 1.8771 23.6216 −24.6996 6.7092 , 128 Chapter 7. QR Algorithm for Eigenvalues which clearly does not lead to a triangular matrix. One step of the Householder reflection changes all the rows of A: ××××× ××××× ××××× ××××× ××××× A Q ∗1 A−−−→ ××××× 0 ×××× 0 ×××× 0 ×××× 0 ×××× Q ∗1 A . Now we multiply Q ∗1A with Q1 to complete the unitary transformation. Since Q ∗1AQ1 = Q ∗ 1 (Q ∗ 1A) ∗, we effectively apply the same Householder reflector to (Q ∗1A) ∗. This changes all the rows of (Q ∗1A) ∗, or all the columns of Q ∗1A, so this may destroy the zeros introduced previously. × 0 0 0 0 ××××× ××××× ××××× ××××× (Q ∗1 A) ∗ Q ∗1 (Q ∗ 1 A) ∗ −−−−−−−→ ××××× ××××× ××××× ××××× ××××× Q ∗1 (Q ∗ 1 A) ∗ (·) ∗−−−→ ××××× ××××× ××××× ××××× ××××× Q ∗1 AQ1 Example 7.3 Consider the following symmetric matrix A, one step of householder reflection (aiming at creating zeros below A(1,1)) A = 34 47 5 18 26 47 10 13 26 34 5 13 26 39 47 18 26 39 42 5 26 34 47 5 18 , leads to a matrix Q1A = −66.2571 −52.5982 −42.7879 −43.9953 −49.4287 0 −36.6911 −9.4027 −3.0631 −1.3606 0 8.0329 23.6167 35.9082 43.2382 0 8.1183 30.4202 30.8695 −8.5423 0 8.1709 34.607 −11.0774 −1.5612 , where Q1 = I − 2u1u ∗1 , u1 = 0.86981 0.40776 0.043379 0.15617 0.22557 . 7.2. Hessenberg Form and Tridiagonal Form 129 Now, multiply Q1A with Q ∗ 1 , we have Q1AQ ∗ 1 = 105.8884 28.1027 −34.2027 −13.0886 −4.7856 28.1027 −23.5167 −8.0012 1.9824 5.9274 −34.2027 −8.0012 21.911 29.7675 34.3683 −13.0886 1.9824 29.7675 28.5196 −11.9367 −4.7856 5.9274 34.3683 −11.9367 −2.8022 , which no longer has those zeros introduced by Q1A. 7.2.1 Householder Reduction to Hessenberg Form Instead of directly transforming a matrix A to a triangular form, we can trans- form it to a Hessenberg form (Phase 1 of the eigenvalue solvers), and then find other ways to obtain the Schur factorisation of the Hessenberg matrix. This can be archived by applying a Householder reflector to the second row of the matrix A at the start. Consider a square matrix A ∈ Rn×n can be partitioned as the following: A = A11 ~a > 1 ~b1 A2 . We want to first find a Householder reflector that transforms~b1 to−(~b1(1))‖~b1‖~e1, which effectively create zeros below the first entry of ~b1. The Householder trans- formation is defined by the unit vector ~u1 = ~v1 ‖~v1‖ , where ~v1 = ~b1 + (~b1(1))‖~b1‖~e1, that determines the reflection hyperplane. This way, we can create a unitary matrix Q1 ∈ Rn×n that leaves the first row of A unchanged Q1 = 1 · · · 0 · · · … 0 U1 … . (7.2) where U1 = I − 2~u1~u ∗1 is the Householder transformation matrix constructed with respect to ~b1. After multiplying Q ∗1 on the left of A, which has the form of Q ∗1A = A11 ~a > 1 ±‖~b1‖ 0 U ∗1 A2 … , (7.3) 130 Chapter 7. QR Algorithm for Eigenvalues we multiply Q1 on the right of Q ∗ 1A. This time, the matrix Q1 leaves the first column of Q ∗1A unchanged, in which we have Q ∗1AQ1 = A11 ~a > 1 U1 ±‖~b1‖ 0 U ∗1 A2U1 … . (7.4) Let A˜2 = U ∗ 1 A2U1, we can have ~b2 = A˜2(:,1), and repeat the above process. Here the unitary matrix Q2 should takes the form of Q2 = I2 · · · 0 · · · … 0 U2 … . (7.5) where U2 = I−2~u2~u ∗2 is the Householder transformation matrix defined by a unit vector ~u2. The matrix Q2 leaves the first two rows and columns of Q ∗ 1AQ1 by multiplying with it on both sides. This process is called Householder reduction. Example 7.4 Consider the following symmetric matrix A, A = 34 47 5 18 26 47 10 13 26 34 5 13 26 39 47 18 26 39 42 5 26 34 47 5 18 , one step of householder reduction (by using a transformation aiming at creating zeros below A(2,1)) leads to a matrix Q ∗1AQ1 = 34 −56.8683 0 0 0 −56.8683 63.585 −42.2045 −23.7277 −17.8221 0 −42.2045 20.561 26.5925 30.0411 0 −23.7277 26.5925 23.1555 −18.7528 0 −17.8221 30.0411 −18.7528 −11.3015 , After two steps we have Q ∗2Q ∗ 1AQ1Q2 = 34 −56.8683 0 0 0 −56.8683 63.585 51.5932 0 0 0 51.5932 48.3358 −3.7237 4.6293 0 0 −3.7237 6.0401 −32.2763 0 0 4.6293 −32.2763 −21.961 . 7.2. Hessenberg Form and Tridiagonal Form 131 7.2.2 Implementation and Computational Cost Remark 7.5 In this section, since we are dealing with real square matrices, each Householder transformation matrix and the resulting Q ∗k · · ·Q ∗1AQ1 · · ·Qk are real. This way, the conjugate transpose is equivalent to transpose here. To set all the entries below the first subdiagonal of a matrix zero, aka the Hessenberg form, the Householder reduction has to be applied n− 2 steps. The algorithm is formulated below. Algorithm 7.6: Householder Reduction to Hessenberg Form Input: A matrix A ∈ Rn×n Output: A Hessenberg matrix A ∈ Rn×n and a sequence of vectors ~uk, k = 1, . . . , n− 2 that defines the sequence of unitary similarity transformations. 1: for k = 1, . . . , n− 2 do 2: ~b = A(k+1:n, k) 3: ~v = ~b+ sign(~b(1))‖~b‖~e1 4: ~uk = ~v/‖~v‖ 5: A(k+1, k) = −sign(~b(1))‖~b‖ 6: A(k+2:n, k) = 0 7: A(k+1:n, k+1:n) = A(k+1:n, k+1:n)− (2~uk) ( ~u>k A(k+1:n, k+1:n) ) 8: A(1:n, k+1:n) = A(1:n, k+1:n)− (A(1:n, k+1:n) ~uk) ( 2~u>k ) 9: end for Remark 7.7 As in the case of applying Householder reflection for computing QR factorisa- tion, the sequence of matrices Qk, k = 1, . . . , n−2 are not formulated explicitly and can be reconstructed from ~uk, k = 1, . . . , n− 2 if necessary. At k-th iteration of the above algorithm, the work required in computing the unit vector ~uk is proportional to n− k (Steps 2-4). Similarly, the work required in applying Householder reflection to ~b is about n− k flops (Steps 5 and 6). The dominating cost lies in the last two lines inside the for loop. In Step 7, the operations A(k+1:n, k+1:n)−· · · and (2~uk) (· · · ) requires (n− k)2 flops, whereas ( ~u>k A(k+1:n, k+1:n) ) requires 2(n− k)2 flops (multiplication and addition). Thus, the work of Step 7 is about 4(n− k)2. Step 8 needs more work, as the operations · · · (2~u>k ) and A(1:n, k+1:n)−· · · requires n(n−k) flops, whereas (A(1:n, k+1:n) ~uk) requires 2n(n− k) flops. Thus, the work of Step 8 is about 4n(n− k). This way, the total work of applying the Householder reduction to transform 132 Chapter 7. QR Algorithm for Eigenvalues matrix to the Hessenberg Form is about: W = n−2∑ k=1 4n(n− k) + 4(n− k)2 +O(n− k) = 4n n−2∑ k=1 (n− k) + 4 n−2∑ k=1 (n− k)2 +O ( n−2∑ k=1 (n− k) ) = 2n3 + 4 3 n3 +O(n2) = 10 3 n3 +O(n2). (7.6) As expected, the dominant term in the expression for the computational work is proportional to n3. We say that the computational complexity of the trans- formation to the Hessenberg Form is cubic in the size of the square matrix, n. 7.2.3 The Symmetric Case: Reduction to Tridiagonal Form If the matrix is symmetric, the above algorithm produces a tridiagonal matrix. Theorem 7.8 The Householder reduction of a symmetric matrix takes a symmetric tridiag- onal form. Proof. Since A is symmetric, Q>AQ is also symmetric. A symmetric Hessen- berg matrix T has zero entries below the first subdiagonal (by the definition of Hessenberg matrix) and zero entries above the first superdiagonal (by sym- metry), and thus is tridiagonal. By using the symmetry, the cost of applying the left and right Householder reflections (Steps 5-8) can be further reduced. The resulting algorithm is formu- lated below. 7.2. Hessenberg Form and Tridiagonal Form 133 Algorithm 7.9: Householder Reduction to Tridiagonal Form Input: A matrix A ∈ Rn×n Output: A Hessenberg matrix A ∈ Rn×n and a sequence of vectors ~uk, k = 1, . . . , n− 2 that defines the sequence of unitary similarity transformations. 1: for k = 1, . . . , n− 2 do 2: ~b = A(k+1:n, k) 3: ~v = ~b+ sign(~b(1))‖~b‖~e1 4: ~uk = ~v/‖~v‖ 5: A(k+1, k) = −sign(~b(1))‖~b‖ 6: A(k, k+1) = A(k+1, k) 7: A(k+2:n, k) = 0 8: A(k, k+2:n) = 0 9: ~t = A(k+1:n, k+1:n)~uk 10: σ = 2~u ∗k~t 11: ~p = 2(~t− σ~uk) 12: A(k+1:n, k+1:n) = A(k+1:n, k+1:n)− ~p~u ∗k − ~uk~p ∗ 13: end for At iteration k, the matrix Q ∗k−1 · · ·Q ∗1AQ1 · · ·Qk−1 is symmetric and is tridi- agonal in the submatrix A(1:k-1, 1:k-1). This way, the left and right multi- plication with Qk effectively creates zeros below A(k + 1, k) and to the right of A(k, k + 1), and then multiplies Uk on the left and right on the submatrix A(k+1:n, k+1:n). The key to reducing the computational cost is to reformulate the following operation: U ∗k A(k+1:n, k+1:n)Uk =A(k+1:n, k+1:n) + 4~uk (~u ∗ kA(k+1:n, k+1:n)~uk) ~u ∗ k− 2~uk (~u ∗ kA(k+1:n, k+1:n))− 2 (A(k+1:n, k+1:n)~uk) ~u ∗k by introducing ~t = A(k+1:n, k+1:n)~uk (7.7) σ = 2~u ∗k~t (7.8) ~p = 2(~t− σ~uk) (7.9) This way, we can rewrite U ∗k A(k+1:n, k+1:n)Uk as a rank-2 update in the form of U ∗k A(k+1:n, k+1:n)Uk = A(k+1:n, k+1:n)− ~p~u ∗k − ~uk~p ∗. Since we only need to store and operate with the half number of entries of a symmetric matrix, the work of the above operation is about 2(n − k)2 flops, together with the 2(n−k)2 flops required by computing ~t. The dominating work in each iteration is about 4(n − k)2, which brings the total work estimate to ∼ 43n3. Remark 7.10 Algorithm 7.9 is provided as background information for intereted readers. The key message here is that the symmetry can reduce the total work load to ∼ 43n3 by 1) avoiding unnecessry zeros and 2) only operating with either lower or upper triangular part of the matrix. 134 Chapter 7. QR Algorithm for Eigenvalues 7.2.4 QR Factorisation of Hessenberg Matrices The Hessenberg and tridiagonal matrices provide substantial computational ad- vantages in computing matrix factorisations such as LU and QR compared with applying such factorisations to general square matrices. Here we give examples on QR factorisation to demonstrate the computational reduction of using the Hessenberg and tridiagonal matrices in terms of operation counts. Recall that that QR factorisation transform a matrix A ∈ Rn×n into the product of an orthogonal matrix Q ∈ Rn×n and an upper-triangular matrix R ∈ Rn×n. The Householder reflection finds a sequence of Q1, Q2, . . . and hence the Q = Q1Q2 . . . to achieve this. Given a Hessenberg matrix H ∈ Rn×n, we can partition H as H = H11 H12 H13 × · · · × H21 H22 H23 × × × H32 × × × × × × × × × × × × × = ~h1 ~a > 1 ~01 H2 . (7.10) where ~h1 = H(1:2, 1) ∈ R2, ~a ∗1 = H(1, 2:end) ∈ Rn−1, ~01 ∈ Rn−2 and H2 = H(2:end, 2:end) ∈ R(n−1)(n−1). Note that H2 is also a Hessenberg matrix. Applying the first step of the Householder reflection, we aim to find Q1 to create zeros below the first row of H(:,1). We need to have Q1H(:,1) = Q1 H11 H21 ~01 = ±‖~h1‖ 0 ~01 . Fortunately the first column of H is filled by zeros below the second row. Thus we only need to apply a 2-dimensional Householder reflection to ~h1 ∈ R2. This way we want to find a 2-by-2 orthogonal matrix U1 such that U1~h1 = [ H11 H21 ] = [ ±‖~h1‖ 0 ] . Using the procedure introduced in Householder reflection, we have ~t = ~h1 − U1~h1 = ~h1 + (~h1(1)) ‖~h1‖~e1, (7.11) ~s = ~t/‖~t‖, (7.12) U1 = I2 − 2~s~s ∗. (7.13) All the above operations are carried in a 2-dimensional space. Then the matrix Q1 takes the form of 7.2. Hessenberg Form and Tridiagonal Form 135 This way, the first full Householder transformation can be written as Q1H = Q1 H11 H12 H13 × · · · × H21 H22 H23 × · · · × H32 × × × × × × × × × × × × × = r11 ~r > 1 0 0 … H˜2 0 0 . (7.14) Note that only the first two rows of the matrix H (marked in red) are modified by Q1. The resulting H˜2 is also a Hessenberg matrix. In fact, H˜2 in (7.14) and H2 in (7.10) only differ in the first row. Then we can repeatedly carry this operation for n− 1 steps as in the QR factorisation. At each step k, the dimension of the Hessenberg matrix to be transformed is n−k+1, thus the amount of work required in applying Qk is ∼ 7(n−k+1)+O(1). Overall, the work of applying n − 1 step of Householder transformation to a n- by-n Hessenberg matrix requires ∼ 72n2 flops. We say that the computational complexity of QR factorisation of a Hessenberg matrix is quadratic in the size of the matrix. If the matrix H is tridiagonal, the number of multiplication with the House- holder matrix Qk required is 3 in each iteration, as shown below: Q1H = Q1 × × × × × × × × × × × × × × × × = r11 ~r > 1 0 0 … H˜2 0 0 . (7.15) Therefore, the work of applying n − 1 step of Householder transformation to a n-by-n tridiagonal matrix is linearly proportional to the size of the matrix, n. We say that the computational complexity of QR factorisation of a tridiagonal matrix is linear in the size of the matrix. 136 Chapter 7. QR Algorithm for Eigenvalues 7.3 QR algorithm without shifts The QR algorithm, which iterative carries the QR factorisation at its core, is one of the most celebrated algorithms in scientific computing. Here we show its simplest form and look into several fundamental aspects of this algorithm. Algorithm 7.11: QR Algorithm Without Shifts Input: Matrix A ∈ Rn×n. Output: A unitary matrix Q(k) and a matrix A(k) 1: A(0) = A 2: Q(0) = I 3: for k = 1, 2, . . . do 4: U (k)R(k) = A(k−1) . Apply the QR factorisation to A(k−1) 5: A(k) = R(k)U (k) . Recombine factors in reverse order 6: Q(k) = Q(k−1)U (k) 7: end for At the core, all we do is compute the QR factorisation, and then multi- ply R and U in the reverse order RU , and repeat. Using the identity R(k) = (U (k))∗A(k−1), it can be shown that this algorithm is applying a sequence of unitary similarity transformation to in the input matrix A, in the form of A(k) = (U (k))∗A(k−1)U (k) = (U (k))∗(U (k−1))∗ · · · (U (1))∗︸ ︷︷ ︸ (Q(k))∗ A U (1) · · ·U (k−1)U (k)︸ ︷︷ ︸ Q(k) . (7.16) Under certain assumptions, this simple algorithm converges to the Schur factori- sation. That is, A(k) will be upper triangular if A is arbitrary, diagonal if A is symmetric. 7.3.1 Connection with Simultaneous Iteration One way to understand the QR algorithm is to relate it to the power iteration. Here we consider applying power iteration to several vectors simultaneously. This is also often referred to as block power iteration. Now consider we have a set of orthonormal initial vectors {~p1, . . . ~ps}, we apply the power iteration to this set of vectors P (such that P (:,j) = ~pj) and normalise the new set of vectors AP (k−1) in each iteration using the QR factorisation. This leads to the following algorithm. Algorithm 7.12: Simultaneous Iteration Input: Matrix A ∈ Rn×n and a set of orthonormal initial vectors P (0). Output: A matrix P (k) 1: P (0) = I 2: for k = 1, 2, . . . do 3: Z(k) = AP (k−1) . Apply the matrix A 4: P (k)T (k) = Z(k) . QR factorisation 5: end for As a result of P (k) = AP (k−1)(T (k))−1, we have AkP (0) = P (k) T (k)T (k−1) · · ·T (1)︸ ︷︷ ︸ T (k) . 7.3. QR algorithm without shifts 137 Using the following property of triangular matrices, we can show that the matrix T (k) = T (k)T (k−1) · · ·T (1) is upper triangular. Therefore the simultaneous iter- ation effectively computes (in exact arithmetic) the QR factorisation of AkP (0). Remark 7.13: Properties of Triangular Matrices The product of two upper triangular matrices is upper triangular and the inverse of an triangular matrices is upper triangular. Theorem 7.14 Given an initial matrix P (0) = I to the simultaneous iteration, it is equivalent to the QR algorithm without shifts. Proof. This can be shown by induction. Throughout the proof, we assume the upper triangular matrices of the QR factorisations used by both the QR algorithm and the simultaneous iteration have positive diagonal entries. We carry the QR algorithm without shifts and the simultaneous iteration for the first step. This leads to QR algorithm: A(0) = A, Q(0) = I, U (1)R(1) = A(0) = A, (7.17) Q(1) = Q(0)U (1) = U (1), (7.18) A(1) = R(1)U (1) = (Q(1))∗AQ(1), (7.19) Simultaneous iteration: P (0) = I, Z(1) = AP (0) = A, (7.20) P (1)T (1) = Z(1) = A. (7.21) After the first iteration, we can verify that Q(1) = P (1) and R(1) = T (1), and thus, these two algorithms are equivalent after the first iteration. In the second iteration, these two algorithms are carried forward as following: QR algorithm: U (2)R(2) = A(1) = (U (1))∗AU (1) (7.22) Q(2) = Q(1)U (2) = U (1)U (2), (7.23) A(2) = R(2)U (2) = (Q(2))∗AQ(2), (7.24) Simultaneous iteration: Z(2) = AP (1) = AU (1), (7.25) P (2)T (2) = Z(2) = AU (1). (7.26) Since A(1) = (Q(1))∗AQ(1), we have A = Q(1)A(1)(Q(1))∗, and hence Equation (7.26) can be written as P (2)T (2) = Q(1)A(1). 138 Chapter 7. QR Algorithm for Eigenvalues Multiplying both sides of the above equation by (Q(1))∗ leads to( (Q(1))∗P (2) ) T (2) = A(1) From the QR algorithm, we have that U (2)R(2) = A(1). This leads to T (2) = R(2) and (Q(1))∗P (2) = U (2), and hence P (2) = Q(1)U (2) = Q(2). Thus, these two algorithms are equivalent after two iterations. Suppose P (k−1) = Q(k−1) and T (k−1) = R(k−1) hold, at k-th iteration, these two algorithms satisfy following: QR algorithm: U (k)R(k) = A(k−1) = (Q(k−1))∗AQ(k−1) (7.27) Q(k) = Q(k−1)U (k), (7.28) A(k) = (Q(k))∗AQ(k), (7.29) Simultaneous iteration: Z(k) = AP (1) = AQ(k−1), (7.30) P (k)T (k) = Z(k) = AQ(k−1). (7.31) Since A(k−1) = (Q(k−1))∗AQ(k−1), we have A = Q(k−1)A(k−1)(Q(k−1))∗, and hence Equation (7.31) can be written as P (k)T (k) = Q(k−1)A(k−1). Multiplying both sides of the above equation by (Q(k−1))∗ leads to( (Q(k−1))∗P (k) ) T (k) = A(k−1). Thus we can show that T (k) = R(k) and (Q(k−1))∗P (k) = U (k). The latter leads to P (k) = Q(k−1)U (k) = Q(k). The above proof employs a property of the QR factorisation: Theorem 7.15 For any nonsingular matrix A, there exists a unique pair of unitary matrix Q and upper triangular matrix R with positive diagonal entries such that A = QR. Remark The product of two upper triangular matrices with positive diagonal entries is also an upper triangular matrix with positive diagonal entries. The inverse of an upper triangular matrix with positive diagonal entries is also an upper 7.3. QR algorithm without shifts 139 triangular matrix with positive diagonal entries. Remark 7.16 At this point, we are able to show that the sequence of unitary similarity transformations in the QR algorithm A(k) = (Q(k))∗AQ(k), (7.32) Q(k) = U (1) · · ·U (k−1)U (k), (7.33) can be defined by the QR factorisation of Ak in the form of Ak = Q(k)T (k) (7.34) T (k) = R(k)R(k−1) · · ·R(1). (7.35) This relation is the key to understand the QR algorithm and to analyse its convergence. 7.3.2 Convergence to Schur Form Yet the remaining question is that why the sequence of transformations A(k) = (Q(k))∗AQ(k) is able to construct a Schur form? This is not very surprising, since the sequence Q(k) = U (1) · · ·U (k−1)U (k) is orthogonal and converges, then Q(k+1) = Q(k)U (k) should be arbitrarily close to Q(k) for sufficiently large k. This way we have U (k) = I for sufficiently large k. Recall that in each iteration of the QR algorithm we have the QR factorisation U (k)R(k) = A(k−1), we can see that A(k−1) is upper triangular if U (k) = I. We formalise this intuition below. Theorem 7.17 Let A ∈ Rn×n be a real matrix with distinct eigenvalues and all eigenvalues are greater than zero, λ1 > λ2 > · · · > λn > 0. Suppose A has the eigendecomposition A = V ΛV −1 and the matrix V has the QR factorisation V = QR where R is upper triangular with positive entries. Then Q(k) converges to the QR factorisation of V as ‖Q(k)D −Q‖ = O(σk), for some diagonal matrix D such that Dii = ±1, where σ < 1 is a constant such that σ = max {∣∣∣∣λ2λ1 ∣∣∣∣ , · · · , ∣∣∣∣ λnλn−1 ∣∣∣∣} . Proof. Given the eigendecomposition A = V ΛV −1, we have that Ak = V ΛkV −1. After k steps of simultaneous iteration Ak has the QR factorisation Ak = Q(k)T (k). Thus the following relation holds: V ΛkV −1 = Q(k)T (k). 140 Chapter 7. QR Algorithm for Eigenvalues Considering the QR factorisation of V −1 and substituting V −1 = LU into the above equation leads to V ΛkLU = Q(k)T (k), and then by multiplying U−1Λ−k on both sides of the equation, we have V ΛkLΛ−k = Q(k)T (k)U−1Λ−k. (7.36) Without loss of generality, we can assume that the diagonal entries of the matrix L takes value ±1, and diagonal entries of the matrix U are positive. ΛkLΛ−k = ±1, i = j 0, i < j Lij ( λi λj )k , i > j , Thus, the ΛkLΛ−k converges to a diagonal matrix D where Dii = Lii as( λi λj )k → 0. Since eigenvalues are ordered, the ratio λiλj , i > j is bounded from the above by the pair of eigenvalues with the largest ratio λi+1λi . This convergence is on the order of O(σk), where σ is the largest ratio between a pair of distinct eigenvalues | λiλj |, i > j. Since the left hand side of Equation (7.36) converges to V D as k → ∞ and D2 = I, it can be expressed as V = ( Q(k)D )( DT (k)U−1Λ−kD ) , k →∞, where T (k)U−1Λ−k is upper triangular with positive diagonal entries and DT (k)U−1Λ−kD is also upper triangular with positive diagonal entries. Thus, this determines a unique QR factorisation of V as k → ∞. Therefore Q(k)D converges to the orthogonal matrix of the QR factorisation of eigenvectors V . The assumptions that all eigenvalues of A must be positive can be removed using the absolute value of eigenvalues instead of eigenvalues in construction Λ−k. We also do not have to assume that eigenvalues are non-repeating, as we can specify orthogonal eigenvectors (basis vectors of the eigenspace) for an eigenvalue with geometric multiplicity larger than one. 7.3.3 The Role of Hessenberg Form As we discussed early on, transforming a matrix to the Hessenberg form allows for a significant reduction in computing the QR factorisation—O(n2) for a general matrices and O(n) for symmetric matrices. It seems we can use this fact to reduce the operation counts in each iteration of the QR algorithm given that each of the Hessenberg form can be retained in each iteration. That is if A(0) is a Hessenberg matrix, then each A(k) is a Hessenberg matrix. Given a Hessenberg matrix H ∈ Rn×n and its QR factorisation H = QR, we want to verify that if RQ retains the Hessenberg form. 7.3. QR algorithm without shifts 141 QR Recall that we can partition the matrix H as H = H11 H12 H13 × · · · × H21 H22 H23 × × × H32 × × × × × × × × × × × × × = ~h1 ~a > 1 ~01 H2 . where ~h1 = H(1:2, 1) ∈ R2, ~a ∗1 = H(1, 2:end) ∈ Rn−1, ~01 ∈ Rn−2 and H2 = H(2:end, 2:end) ∈ R(n−1)(n−1). Note that H2 is also a Hessenberg matrix. To create zeros below H(1,1). We need to find a Householder matrix Q1 such that Q1H(:, 1) = Q1 H11 H21 ~01 = ±‖~h1‖ 0 ~01 . Effectively we only need to apply a 2-dimensional Householder reflection matrix U1 to ~h1 ∈ R2 such that U1~h1 = [ H11 H21 ] = [ ±‖~h1‖ 0 ] . Then the matrix Q1 takes the form of Q1 = U1 0 0 I . Only the top two rows of the matrix H will be modified by Q1H. Every iteration of the QR factorisation picks the k-th column of the ma- trix and aims to create zeros below the (k,k) entry of the matrix Hk−1 = Qk−1 · · ·Q1H (the transformed matrix from the previous iteration) as shown below. Hk−1 = k × × × × × × × × × × × × × × × × × × k × × × × k + 1 × × × × × × × × × This can be archived by constructing a Householder matrix Uk w.r.t. the vector Hk(k:k+1,k) (since all the entries of Hk below (k+1,k) are zero). This leads to 142 Chapter 7. QR Algorithm for Eigenvalues a Householder matrix that can be applied to the original matrix, Qk = 1 1 1 −→ I1 Uk ←− [ U11 U12 U21 U22 ] I2 ←− [ 1 1 ] where I1 is a k − 1 dimensional identity matrix and I2 is a n − k − 1 dimen- sional identity matrix. The following equation demonstrates the multiplication of QkHk−1, where red marks are entries being modified in the process. QkHk−1 = 1 1 1 −→ I1 Uk ←− [ U11 U12 U21 U22 ] I2 ←− [ 1 1 ] k × × × × × × × × × × × × × × × × × × × × × × k × × × × k+1 × × × × × = k × × × × × × × × × × × × × × × × × × × × × × k 0 × × × k+1 × × × × × . This way, we have the QR factorisation of the matrix H defined as R = Qn−1 · · ·Q1︸ ︷︷ ︸ Q∗ H. Note that we take the Qn out of the standard Householder reflection process, as it only flips the value of bottom right entry of the matrix Hn−1. RQ Using this identity, we can express the matrix RQ as RQ = RQ1Q2 · · ·Qn−1. Note that we drop the (·)∗ here as each Householder reflection matrix Qj is symmetric. Denote Rk = RQ1Q2 · · ·Qk and set R0 = R, in each multiplication of RkQk+1, only two columns Rk(:,k:k+1) are modified by the matrix Uk+1. This is summarised in the following equations. 7.3. QR algorithm without shifts 143 In the first step, entries below R1(2, 1:2) are zero as R is upper triangular. The resulting matrix R1 has a Hessenberg form with the submatrix R1(2:n, 2:n) is upper triangular, as shown in Equation (7.37). R1 = R0Q1 = 1 2 1 × × × × × × × 2 0 × × × × × × × × × × × × × × × × × × × × × [ U11 U12 U21 U22 ] −→ U1 I ←− 1 1 1 1 1 = 1 2 1 × × × × × × × 2 × × × × × × × × × × × × × × × × × × × × × × (7.37) If the matrix Rk−1 has a Hessenberg form, and the submatrix Rk−1(k:n,k:n) is upper triangular, multiplying with Qk will produce a Hessenberg matrix Rk with an upper triangular submatrix Rk(k+1:n, k+1:n)—only two columns Rk−1(:,k:k+1) are modified by the matrix Uk in this step, and Rk(k+2:n,k:k+1) are zero as Rk has a Hessenberg form. Rk = Rk−1Qk = k k + 1 × × × × × × × × × × × × × × × × × × × × k × × × × × k + 1 0 × × × × × × 1 1 1 −→ I1 Uk ←− [ U11 U12 U21 U22 ] I2 ←− [ 1 1 ] = k k + 1 × × × × × × × × × × × × × × × × × × × × k × × × × × k + 1 × × × × × × × (7.38) Also from this process we can conclude that computing QR = H and combining factors in the reverse order RQ have the same total work. In QR = H, the number of flops is about O(n− k) in iteration k, and hence a total of O(n2) for 144 Chapter 7. QR Algorithm for Eigenvalues computing QR = H. Therefore, each step of the QR algorithm requires O(n2) operation counts. 7.4. Shifted QR algorithm 145 7.4 Shifted QR algorithm The QR algorithm without shift is able to iteratively decompose a matrix to a Schur factorisation. Using its equivalence with the simultaneous iteration, we can show its convergence property—they are equally slow. Like the Rayleigh quotient iteration, this algorithm can be modified to incorporate shifted inverse iteration and eigenvalue estimates. This new algorithm is outlined as the following: Algorithm 7.18: Shifted QR Algorithm Input: Matrix A ∈ Rn×n. Output: A unitary matrix Q(k) and a matrix A(k) 1: A(0) = (Q(0))∗AQ(0) . Transform A to Hessenberg form. 2: for k = 1, 2, . . . do 3: Pick a shift µ(k) . E.g., µ(k) = A(k−1)(n, n) 4: U (k)R(k) = A(k−1) − µ(k)I . QR factorisation to A(k−1) − µ(k)I 5: A(k) = R(k)U (k) + µ(k)I . Recombine factors in reverse order 6: if any off diagonal entry A(k)(j+1, j) is sufficiently close to 0 then 7: Set A(k)(j+1, j) = 0 and partition H(k) as A(k) = [ A1 A3 0 A2 ] , and apply the same QR algorithm to A1 and A2 separately. 8: end if 9: Q(k) = Q(k−1)U (k) 10: end for Here Line 3 is picks the shift value, Lines 4 and 5 perform one step of inverse iteration, and Lines 6-8 perform an operation called deflation. These steps will be explained in the rest of this section. To keep the concept simple, we assume the matrix A ∈ Rn×n is symmetric (and tridiagonal) and invertible in the rest of this section. We will also only focus on the eigenvalues. The material in section is based on [Trefethen and Bau III, 1997]. 7.4.1 Connection with Inverse Iteration To understand this algorithm, we will first find its connection with the power iteration applied to the inverse of the matrix A−1, or inverse iteration without shift. Recall the results from the last section, we have Ak = Q(k)T (k), where T (k) = R(k)R(k−1) · · ·R(1), as the result of the QR algorithm without shift. Inverting the above equation and taking the transpose, we have ( A−k )> = (( T (k) )−1 ( Q(k) )>)> . Using the fact that A is symmetric, this leads to A−k = Q(k) ( T (k) )−> , (7.39) 146 Chapter 7. QR Algorithm for Eigenvalues where the term ( T (k) )−> is lower triangular. Consider a permutation matrix P that reverses the row or column order P = 1 1 … 1 . Remark 7.19 Multiplying P on the right of a matrix reverses the order of column order, and multiplying P on the left of a matrix reverses the row order. This takes the form of A = ~a1 ~a2 · · · ~an−1 ~an , AP = ~an ~an−1 · · · ~a2 ~a1 , and B = ~b>1 ~b>2 … ~b>n−1 ~b>n , PB = ~b>n ~b>n−1 … ~b>2 ~b>1 . We also have that P 2 = I, so P is orthogonal. Multiplying both sides of Equation (7.39) by the permutation matrix P on the right, we have A−kP = ( Q(k)P )( P ( T (k) )−> P ) , (7.40) The first factor Q(k)P is orthogonal, and the second factor P ( T (k) )−> P is upper triangular (by revering the column and row orders of a lower triangular matrix). Thus, Equation (7.40) can be interpreted as the QR factorisation of A−kP . The QR algorithm without shift effectively also carries the simultaneous iteration on A−1 with an initial matrix P . This can be expressed as the following A−k 1 1 1 … 1 = ~q(k)n ~q(k)n−1 · · · ~q(k)2 ~q(k)1 ︸ ︷︷ ︸ Q(k)P ( P ( T (k) )−> P ) ︸ ︷︷ ︸ upper triangular . The last column of Q(k) is the result of applying the inverse iteration on ~en. 7.4. Shifted QR algorithm 147 7.4.2 Connection with Shifted Inverse Iteration The significance of the inverse iteration is that it can be shifted to amplify the difference between eigenvalues. Since the QR algorithm is both simul- taneous iteration Ak = Q(k)T (k) and simultaneous inverse iteration A−kP =( Q(k)P ) (P (T (k))−>P ), we are able to incorporate shift into the QR algorithm by simply carrying QR factorisation on the shifted matrix A− µI. Let µ(k) denote the shift used in k-th step, one step of the shifted QR proceeds as the following: U (k)R(k) = A(k−1) − µ(k)I, (7.41) A(k) = R(k)U (k) + µ(k)I. (7.42) This implies A(k) = ( U (k) )> A(k−1)U (k), and by induction A(k) = ( Q(k) )> AQ(k), (7.43) Q(k) = U (1)U (2) · · ·U (k). (7.44) Note that here each pair of Uk and R(k) is different from the QR algorithm without shifts. Using a similar proof as in Theorem 7.14, we can show that the shifted QR algorithm also has the following factorisation (A− µ(k)I)(A− µ(k−1)I) · · · (A− µ(1)I) = Q(k)T (k), (7.45) T (k) = R(k)R(k−1) · · ·R(1). (7.46) Using the connection between the QR algortihm and simultaneous shifted inverse iteration, we can show that k∏ j=1 ( A− µ(j)I )−1 1 1 1 … 1 = ~q(k)n ~q(k)n−1 · · · ~q(k)2 ~q(k)1 ︸ ︷︷ ︸ Q(k)P ( P ( T (k) )−> P ) ︸ ︷︷ ︸ upper triangular . Q(k) is the orthogonalisation of ∏1 j=k(A−µ(j)I), while Q(k)P is the orthogonal- isation of ∏k j=1(A − µ(j)I)−1. That is, the last column of Q(k) is the result of applying inverse iteration (using the shift µ(k) to µ(1)) to the vector ~en. Generally speaking, the last column of Q(k) converges fast to an eigenvector. 7.4.3 Connection with Rayleigh Quotient Iteration To complete the loop, we need to pick a shift value to archive fast convergence in the last column of Q(k), a natural choice is to use the Rayleigh quotient µ(k) = (~q (k) n )∗A~q (k) n (~q (k) n )∗~q (k) n = (~q(k)n ) ∗A~q(k)n , 148 Chapter 7. QR Algorithm for Eigenvalues as Q(k) is orthogonal. Furthermore we have ~q (k) n = Q(k)~en, since A = Q(k)A(k−1) ( Q(k) )> , we have (~q(k)n ) ∗A~q(k)n = (~en) >A(k−1)~en = A(n, n). Thus the (n, n) entry of the matrix A(k−1) gives an eigenvalue estimate for the last column of Q(k) without any additional work. This is usually referred to as the Rayleigh quotient shift. 7.4.4 Wilkinson Shift The Rayleigh quotient shift does not guarantee the convergence. It may stall for certain types of matrices. For example, the following matrix A = [ 0 1 1 0 ] . Applying the QR without shifts to this matrix does not converge as A = QR = [ 0 1 1 0 ] [ 1 0 0 1 ] , and RQ = [ 0 1 1 0 ] = A. The Rayleigh quotient is A(2, 2) = 0, and hence it does not shift the matrix neither. The problem is that the matrix A has two eigenvalues 1 and -1, the eigenvalue estimate 0 is between two eigenvalues. It has an equal tendency towards both eigenvalues. One particular method that can break the symmetry is call Wilkinson shift. Instead of using the lower-rightmost entry of A, it uses the lower-rightmost 2-by- 2 submatrix of A, denoted by B = A(n-1:n, n-1:n). Suppose B takes the form of B = [ a1 b1 b2 a2 ] The Wilkinson shift is the eigenvalue of B that is closer to a2. If there is a tie, it will pick one of the two eigenvalues arbitrarily. A numerical stable formula of µ = a2 − (δ)b1b2|δ|+√δ2 + b1b2 , where δ = a1 − a2 2 , where (δ) is set arbitrarily to either 1 or -1 if δ = 0. The Wilkinson shift provides the same convergence as the Rayleigh quotient, cubic for symmetric matrices and quadratic for general matrices. Its convergence is guaranteed. 7.4.5 Deflation In Lines 6-8 of Algorithm 7.18, if any off diagonal entry A(j+1, j) is sufficiently close to 0 then we can set A(j+1, j) = 0 and partition the matrix as following A(k) = [ A1 A3 0 A2 ] . 7.4. Shifted QR algorithm 149 This technique is call deflation. It divides the problem to sub-problems and tackle them individually. Here we briefly explain the concept on general matrices. Since det(A(k)) = det(A1) det(A2), finding the eigenvalues (or computing its Schur form) of A(k) simply becomes computing the Schur form of A1 and A2 separately. Suppose we have computed the Schur factorisation of A1 and A2 in the form of A1 = U1T1U ∗ 1 , (7.47) A2 = U2T2U ∗ 2 , (7.48) respectively. We can construct a n-by-n unitary matrix U (k+1) = [ U1 U2 ] , so that (U (k+1))∗A(k)U (k+1) = [ U∗1 U∗2 ] [ A1 A3 0 A2 ] [ U1 U2 ] = [ T1 A˜3 0 T2 ] , where A˜3 = U ∗ 1A3U2. 150 Chapter 7. QR Algorithm for Eigenvalues Chapter 8 Singular Value Decomposition 8.1 Singular Value Decomposition The singular value decomposition of a matrix is often referred to as the SVD. The SVD factorizes a matrix A ∈ Rm×n into the product of three matrices A = UΣV > where U and V are orthogonal, and Σ is diagonal. Here A can be any matrix, e.g., non-symmetric or rectangular. 8.1.1 Understanding SVD A matrix A ∈ Rm×n is a linear transformation taking a vector ~x ∈ Rn in its row space (or preimage), row(A), to a vector ~y = A~x in its column space (or range), col(A). The SVD is motivated by the following geometric fact: the image of the unit sphere under any m-by-n matrix is a hyper-ellipse. Remark 8.1 The hyper-ellipse is a generalisation of an ellipse. In the space Rm, a hyper-ellipse can be viewed as the surface obtained by stretching a unit sphere in Rm by some factors σ1, σ2, . . . , σm, along some orthogonal direc- tions ~u1, ~u2, . . . , ~um. Here each of the ~ui, i = 1, . . . ,m is a unit vector. The vectors {σi~ui} are the principle semiaxes of the hyper-ellipse. Figure 8.1.1: Geometrical interpretation of a linear transformation. 151 152 Chapter 8. Singular Value Decomposition Figure 8.1.1 shows a unit sphere and the hyper-ellipse that is the image of the unit sphere transformed by a matrix A ∈ Rm×n. Assume that m > n and the matrix A has a rank r ≤ min(m,n), three key components of the SVD can be defined as: • The singular values of the matrix A are the lengths of the principle semi- axes, σ1, σ2, . . . , σr. We often assume that the singular values are non- negative and ordered as σ1 ≥ σ2 ≥ . . . ≥ σr > 0. • The left singular vectors of A are orthogonal unit vectors ~u1, ~u2, . . . , ~ur that are in the column space of A and oriented in the direction of principle semiaxes. • We also have the right singular vectors ~v1, ~v2, . . . , ~vr that are orthogonal unit vectors in the row space of the matrix A such that A~vi = ~uiσi, i = 1, . . . , r. (8.1) The relationship between the right singular vectors, left singular vectors, and singular values can be understood as the following: the first right singular vector is a unit vector ~v such that the 2-norm of the vector A~v is maximised. This way, we have ~v1 = argmax ‖~v‖=1 ‖A~v‖. The corresponding first singular value is defined as σ1 = ‖A~v1‖ and the first left singular vector is A~v1σ1 . Then, the second right singular vector is defined as the next unit vector ~v that is orthogonal to ~v1 and maximises the 2-norm of the vector A~v. We have ~v2 = argmax ~v>~v1=0,‖~v‖=1 ‖A~v‖. The corresponding second singular value is σ2 = ‖A~v2‖ and the second left singular vector is A~v2σ2 . Repeating this process we can define all the singular values and singular vectors. In summary, transforming a right singular vector ~vi using the matrix A leads to the left singular ~ui multiplied with σi. Thus, right singular vectors and left singular vectors characterise the principle directions (in row space and column space) of the linear transformation defined by A. Singular values characterise the “stretching” effect of this linear transformation. Remark 8.2 The right and left singular vectors also satisfy the following duality: A~vi = ~uiσi, A >~ui = ~viσi, for i = 1, . . . , r. 8.1. Singular Value Decomposition 153 8.1.2 Full SVD and Reduced SVD Assuming m ≥ n, the collection of the equations (8.1) for all i = 1, . . . , r can be expressed as a matrix equation A ~v1 ~v2 . . . ~vr = ~u1 ~u2 . . . ~ur σ1 σ2 . . . σr , (8.2) or AVˆ = Uˆ Σˆ, (8.3) in a matrix form. In this matrix form, Vˆ ∈ Rn×r and Uˆ ∈ Rm×r are matrices with orthonormal columns, and Σˆ ∈ Rr×r is a diagonal matrix. Columns of the matrices Vˆ ∈ Rn×r and Uˆ ∈ Rm×r are orthonormal vectors, however, they do not form complete bases of Rn and Rm unless m = n = r. By adding m− r unit vectors that are orthogonal to columns of Uˆ and adding n− r unit vectors that are orthogonal to columns of Vˆ , we can extend the matrix Uˆ to an orthogonal matrix U ∈ Rm×m and the matrix Vˆ to an orthogonal matrix V ∈ Rn×n. If Uˆ and Vˆ are replaced by U and V in Equation (8.3), then Σˆ will have to change too. We can add an (m − r)× r block of zeros under the matrix Σˆ and an m× (n− r) block of zeros on the right of Σˆ to form a new matrix Σ. This is demonstrated as the following: This way, we have AV = UΣ, (8.4) where both U and V are orthogonal matrices. This is exactly the same as Equation (8.3) as those additional columns in U are multiplied with zeros, and those additional columns of V are in the null space of A. Multiply both sides of Equation (8.4) by V > on the right, we obtain the full SVD. 154 Chapter 8. Singular Value Decomposition Definition 8.3 For a matrix A ∈ Rm×n, where m > n, the full singular value decompo- sition is defined by an orthogonal matrix U ∈ Rm×m, an orthogonal matrix V ∈ Rn×n, and a diagonal matrix Σ ∈ Rm×n with non-negative diagonal en- tries in the form of A = UΣV >. (8.5) Definition 8.4 By eliminating those columns in U and V that are multiplied with zeros in Σ in the full SVD, we can also define the reduced singular value decomposition as A = Uˆ ΣˆVˆ > = r∑ i=1 σi ~ui ~v > i . (8.6) The full SVD is often useful in deriving properties of a matrix, whereas the reduced SVD is often very valuable for computational tasks. The full SVD and the reduced SVD can be summarised by the following figure: Remark 8.5 Considering the full SVD factorization, A = UΣV >, the linear transformation defined by A can be decomposed into several steps (as shown in Figure 8.1.2: 1. Given a unit sphere in the row space of the matrix A. 2. Multiplication with V >. This is a rotation, since V is an orthogonal matrix. 3. Multiplication with Σ. The diagonal matrix Σ stretches the new unit sphere along its canonical basis vectors (grey lines) with singular values σ1, σ2, . . .. 4. Multiplication with U . This is another rotation, since U is also an or- thogonal matrix. Thus, SVD connects the four fundamental subspaces of a linear transformation: 1. ~v1, ~v2, . . . , ~vr: an orthonormal basis for the row space of A, row(A) 2. ~u1, ~u2, . . . , ~ur: an orthonormal basis for the column space of A, col(A) 3. ~vr+1, . . . , ~vn: an orthonormal basis for the null space of A, null(A) 4. ~ur+1, . . . , ~um: an orthonormal basis for the left null space of A, null(A >) 8.1. Singular Value Decomposition 155 Figure 8.1.2: Geometrical interpretation of the SVD. Remark 8.6: The m < n case For a matrix A ∈ Rm×n, where m < n, both reduced SVD and full SVD can also be defined—a quick way of doing so is to apply the above process to the matrix A>. 8.1.3 Properties of SVD It is important to know that SVD exists for any general matrix A ∈ Rm×n. Theorem 8.7 Every matrix A ∈ Rm×n has a singular value decomposition. Proof. This can be shown using induction, we omit the proof here. As stated in Remark 8.5, SVD can characterise all four fundamental sub- spaces of a matrix. Here we use the full SVD of a matrix to explore some important properties of a matrix. Theorem 8.8 The rank of a matrix A ∈ Rm×n is equal to the number of its nonzero singular values. Proof. Consider the full SVD of A = UΣV >. Suppose that there are r nonzero singular values, and hence rank(Σ) = r as the rank of a diagonal matrix is equal to the number of nonzero entries. Since U and V are full rank, we have rank(A) = rank(Σ) = r. 156 Chapter 8. Singular Value Decomposition Theorem 8.9 The Frobenius norm of a matrix A ∈ Rm×n is equal to the square root of the sum of square of its nonzero singular values, i.e., ‖A‖F = √√√√ r∑ i=1 σ2i . Proof. Consider the full SVD of A = UΣV >. Since the Frobenius norm is preserved under multiplication with orthogonal matrices, we have ‖A‖F = ‖Σ‖F . Given that ‖Σ‖F = √∑r i=1 σ 2 i , we have ‖A‖F = √∑r i=1 σ 2 i . Theorem 8.10 The 2-norm of a matrix A ∈ Rm×n is equal to the largest singular value of the matrix A, i.e., ‖A‖2 = σ1. Proof. Consider the full SVD of A = UΣV >. Since the 2-norm is preserved under multiplication with orthogonal matrices, we have ‖A‖2 = ‖Σ‖2. Given that ‖Σ‖2 = σ1, we have ‖A‖2 = σ1. 8.1.4 Compare SVD to Eigendecomposition The theme of diagonalising a matrix by expressing it in terms of a new basis is not new—it has already been discussed in eigendecomposition. A nondefective square matrix can be transformed to a diagonal matrix of eigenvalues using a similarity transformation defined by its eigenvectors. For a general nondefective square matrix A ∈ Rn×n, its eigendecomposition takes the form of A = WΛW−1, where W is the matrix of n distinct eigenvectors and Λ is the diagonal matrix consists of the eigenvalues of A. SVD is fundamental different from the eigendecomposition in several aspects: 1. The SVD uses two bases U and V , whereas the eigendecomposition only uses one. 2. The matrix W in the eigendecomposition may not be orthogonal, but the matrices U and V in the SVD are always orthogonal. 3. The SVD does not require that the matrix A is a square matrix, as it works for any matrices. In applications, the eigendecomposition is usually more relevant to matrix func- tions, e.g., Ak and exp(tA). The SVD is usually more relevant to the matrix itself and its inverse. 8.1. Singular Value Decomposition 157 Real and symmetric matrices have a special eigendecomposition. We know that (by Theorem 6.24) if A ∈ Rn×n is symmetric and real-valued, it has orthog- onal eigenvectors and the eigendecomposition A = QΛQ> where Q is the matrix of n distinct eigenvectors and Λ is the diagonal matrix consists of the eigenvalues of A. In this case, the singular values of A is just the absolute value of the eigenvalues of A. Using the eigendecomposition of A, we can express the SVD as A = Q|Λ|sign(Λ)Q> = Q|Λ| (Q sign(Λ))> . The left singular vectors are the same as eigenvectors and the right singular vectors are eigenvectors flipped by the sign of the eigenvalues—if an eigenvalue is negative, we set the singular value to be the absolute value of the eigenvalue, and multiply the corresponding eigenvector(s) by -1 to obtain the right singular vectors. 158 Chapter 8. Singular Value Decomposition 8.2 Computing SVD 8.2.1 Connection with Eigenvalue Solvers Computing the orthonormal bases {ui}ri=1 and {vi}ri=1 for the column space and the row space of a matrix A ∈ Rm×n is easy, e.g., Gram-Schmidt process can be used for this purpose. However, in general, there is no reason to expect the matrix A to transform an arbitrary choice of bases {vi}ri=1 to another orthogonal bases. For a general rank-r matrix A with m rows and n columns, the SVD aims at finding a set of orthonormal bases {vi}ri=1 for the row space of A that gets transformed into a set of orthonormal bases {ui}ri=1 for the column space of A, stretched by some {σi}ri=1, i.e., A~vi = ~uiσi, σi > 0, i = 1, . . . , r. The key step towards finding the orthonormal matrices U and V is to use the full SVD A = UΣV >. Rather than solving for U , V and Σ simultaneously, we can take the following steps to obtain the SVD of a matrix A (assuming m > n): 1. Multiplying both sides by A> = V ΣU> on the left to get A>A = V ΣU>UΣV > = V Σ2V > = [ ~v1 ~u2 . . . ~vn ] σ21 σ22 . . . σ2n [~v1 ~v2 . . . ~vn]> . This problem can be solved by the eigendecomposition of the symmetric, n×n matrix A>A, where {~vi}ni=1 are the eigenvectors and {σ2i }ni=1 are the eigenvalues. 2. Compute the eigendecomposition of A>A = V ΛV >. Then set V be the right singular vectors and Σ = √ Λ be the singular values. 3. We can solve the linear system UΣ = AV to obtain the left singular vectors U . In the absence of numerical error, this is equivalent to solving the eigendecomposition of AA> = UΣ2U>. Note that we have at most min{m,n} nonzero eigenvalues. Remark 8.11 The above method is widely used in many areas for computing SVD of a matrix, for example, in the principle component analysis. However, a major shortfall of this method is that it is not numerically stable for computing singular values σi ‖A‖. Suppose we have an input matrix A. The floating point representation of A has an error on the order of machine‖A‖. A numerical stable algorithm requries 8.2. Computing SVD 159 that the error of an estimated singular value σ˜i is on the order of machine‖A‖. That is, |σ˜i − σi| = O (machine ‖A‖) . Consider the above process, the error in estimating the eigenvalues of A>A (singular values squared) using a numerically stable eigenvalue solver is about |σ˜2i − σ2i | = O ( machine ‖A>A‖ ) . The error of computing the square root to find σ˜i is on the order of |σ˜2i−σ2i | σi . Thus, the error of an estimated singular value σ˜i using the above process is |σ˜i − σi| = O ( machine ‖A>A‖ σi ) = O ( machine ‖A‖2 σi ) . An intuitive way to understand this is the following: the product A>A am- plifies the numerical error quadratically in the eigenvalue estimation step, and then the absolute error in computing a singular value (by solving the square root of an eigenvalue) is on the order of the error of estimated eigenvalue divided by σi. This way, the above method is usually fine for computing dominate singular values, i.e., σi 0. However, for computing those singular values σi ‖A‖, the resulting singular value estimate will be dominated by the error. 8.2.2 A Different Connection with Eigenvalue Solvers An alternative way to computing the SVD of A ∈ Rm×n using eigendecomposi- tion is to consider the following (n+m)-by-(n+m) matrix S = [ 0 A> A 0 ] . The eigenvector and eigenvalue of the matrix S can be expressed as[ 0 A> A 0 ] [ ~v ~u ] = λ [ ~v ~u ] . where ~v ∈ Rn and ~u ∈ Rm. This equation leads to{ A>~u = λ~v A~v = λ~u , which implies A>A~v = λ2~v and AA>~u = λ2~u. Thus, if the matrix S has an eigenvalue λ ≥ 0, then the corresponding eigenvector [ ~v ~u ] defines a pair of right and left singular vectors given that both ~v and ~u are unit vectors. The eigenvalue λ ≥ 0 defines the corresponding singular value. We note that if λ is an eigenvalue of S, then −λ is also an eigenvalue associ- ated with an eigenvector [ ~v −~u ] . This can be easily verified by [ 0 A> A 0 ] [ ~v −~u ] = −λ [ ~v −~u ] . 160 Chapter 8. Singular Value Decomposition Thus, both the singular values of a matrix A and its negatives are eigenvalues of S. Now we can express the eigendecomposition of the matrix S by the SVD of A = UΣV >, and vice versa. We consider that the matrix A is a square matrix (i.e., m = n)—in fact, computing the SVD of a general matrix with m 6= n can be effectively reduced to computing the SVD of a square matrix, this will be shown in later part of this section. This way, we have[ 0 A> A 0 ] [ V V U −U ] = [ V V U −U ] [ Σ 0 0 −Σ ] . Since singular vectors are unit vectors, we can normalise an eigenvector [ ~v ~u ] or[ ~v −~u ] by scaling it by a factor of 1/ √ 2. Thus, using an orthogonal matrix Q = 1√ 2 [ V V U −U ] , we can express the eigendecomposition of S in the form of S = Q [ Σ 0 0 −Σ ] Q>. Therefore, the SVD can be obtained by computing the eigendecomposition of the matrix S. In contrast to the method using the eigendecomposition A>A, the new method is numerically stable as it does not involve the square root of eigenvalues. Remark 8.12 In practice, the matrix S is never formed explicitly. Factorisations of S, such as the QR factorisation and the eigendecomposition, can be obtained by using the matrix A and the symmetry. 8.2.3 Bidiagonalisation As in the eigenvalue solvers, algorithms for computing SVD are also often have two phases: In the phase 1, the matrix A is reduced to a bidiagonal form, in order to save floating point operations in computing the eigendecomposition of S or A>A. In the phase 2, eigenvalue solvers such as the shifted QR algorithm can be used to diagonalise S or A>A, and hence A, to find singular values. This process is shown as the following: ×××× ×××× ×××× ×××× ×××× ×××× ×××× A Phase 1−−−−−→ U>0 AV0 ×× ×× ×× × Bidiagonal B Phase 2−−−−−→ U>BV × × × × Diagonal Σ We will focus on the phase 1 of this process and omit details of the phase 2. 8.2. Computing SVD 161 Remark 8.13 Suppose the matrix A ∈ Rm×n and m > n. In the bidiagonalisation step, both U0 ∈ Rm×m and V0 ∈ Rn×n are orthogonal matrices, and the last m− n rows of B have zero values, which can be shown as the following: Consider the nonzero block of the matrix B, denoted by Bˆ, and its SVD, Bˆ = UBΣˆV > B , where UB ,Σ, VB ∈ Rn×n. Constructing the orthogonal matrix QB = [ UB 0 0 I ] , and the zero padded matrix Σ = [ Σˆ 0 ] , we can define the SVD of the matrix B as B = QBΣV > B . This is demonstrated as the following: Since B = U>0 AV0, we have A = U0BV > 0 , and thus A = U0QBΣV > B V > 0 = (U0QB)︸ ︷︷ ︸ U Σ (V0VB) >︸ ︷︷ ︸ V > . This way, computing the SVD of the original matrix A can be effectively reduced to computing the SVD of a n-by-n matrix Bˆ. Golub-Kahan Bidiagonalisation The goal of bidiagonalisation is to multiply the matrix A by a sequence of uni- tary/orthogonal matrices on the left, and another sequence of unitary/orthogo- nal matrices on the right to obtain a bidiagonal matrix that has zeros below its diagonal and zeros above its first superdiagonal. 162 Chapter 8. Singular Value Decomposition This process is significantly different from the reduction of a matrix to the tridiagonal form. In the reduction to the tridiagonal form, the input matrix should be square and the same sequence of unitary/orthogonal matrices are ap- plied on both sides of the matrix. In the bidiagonalisation, the input matrix does not need to be a square matrix, and two different sequences of unitary/orthogo- nal matrices are applied on the left and on the right of the matrix—the numbers of matrices applied in the two sequences are not necessarily the same. The simplest method for accomplishing this is the Golub-Kahan bidiagonali- sation. It applies Householder reflection alternately on the left and on the right of a matrix. The left Householder reflection aims to introduce zeros below the diagonal, whereas the right Householder reflection aims to introduce zeros to the right of the first superdiagonal. This way, zeros introduced by the left House- holder reflection will not be modified by the right Householder reflection, and previously introduced zeros will not be modified by later Householder reflections. This process can be demonstrated by the following example. Example 8.14 Consider a matrix A ∈ R7×4, applying Householder reflection alternately on the left and on the right of A produces a bidiagonal form. This Golub-Kahan bidiagonalisation can be shown as: ×××× ×××× ×××× ×××× ×××× ×××× ×××× A U>1 (·)−−−−→ ×××× ××× ××× ××× ××× ××× ××× U>1 A (·)V1−−−→ ×× ××× ××× ××× ××× ××× ××× U>1 AV1 U>2 (·)−−−−→ ×× ××× ×× ×× ×× ×× ×× U>2 U > 1 AV1 (·)V2−−−→ ×× ×× ×× ×× ×× ×× ×× U>2 U > 1 AV1V2 U>3 (·)−−−−→ ×× ×× ×× × × × × U>3 U > 2 U > 1 AV1V2 U>4 (·)−−−−→ ×× ×× ×× × U>4 U > 3 U > 2 U > 1 AV1V2 . The four left multiplications introduce zeros below diagonal, and the two right multiplications introduce zeros above the first superdiagonal. For a matrix A ∈ Rm×n, n Householder reflections have to be applied on the left and n− 2 Householder reflections have to be applied on the right. The total work of the Golub-Kahan bidiagonalisation is about doubling the work of the QR factorisation—the left Householder reflections have the same work as computing the QR factorisation of A, and the right Householder reflections have the same work as computing the QR factorisation of A> except the first row. Thus the total work of the Golub-Kahan bidiagonalisation is ∼ 4mn2 − 43n3 flops. 8.2. Computing SVD 163 Lawson-Hanson-Chan Bidiagonalisation For the case where m n, the total work of the Golub-Kahan bidiagonalisa- tion is unnecessarily high. If we know the matrix in a bidiagonal form has zeros below its n-th row, then the right Householder reflections in the bidiagonalisa- tion process should try to avoid modify those entries. This can be accomplished by first applying a QR factorisation to the input matrix, and then apply the Golub-Kahan bidiagonalisation to the upper triangular matrix to reduce it to the bidiagonal form. This procedure is called the Lawson-Hanson-Chan (LHC) bidiagonalisation. It can be demonstrated as the following: In the LHC bidiagonalisation, the work of the QR step is ∼ 2mn2 − 23n3 flops, and the work of the subsequent bidiagonalisation of the upper triangular matrix is about ∼ 4n3 − 43n3 = 83n3 flops. Thus, the total work of the LHC bidiagonalisation is ∼ 2mn2 + 2n3 flops. This requires less operation counts than the Golub-Kahan bidiagonalisation if m > 53n. From Bidiagonal Form of A to Tridiagonal Form of A>A and S We have seen that in the phase 1 of an eigenvalue solver, a symmetric matrix can be reduced to a tridiagonal matrix. In computing SVD, reducing a matrix to a bidiagonal form is an analogy of the phase 1 of eigenvalue solvers. In fact, reducing a matrix A to a bidiagonal form is equivalent to reducing the matrices S and A>A to a tridiagonal form. As shown in Remark 8.13, computing SVD of a general matrix A ∈ Rm×n with m > n can be effectively reduced to computing SVD of a square bidiagonal matrix B ∈ Rn×n. This way, computing SVD using the eigendecomposition of A>A is reduced to finding the eigendecomposition of B>B. It is easy to verify that the matrix B>B is a symmetric tridiagonal matrix. For computing SVD using the eigendecomposition of the matrix S, we effec- tively solving the eigendecomposition of the matrix SB = [ 0 B> B 0 ] . This matrix SB has a tridiagonal form by swapping rows and columns using an orthogonal similarity transformation defined by some permutation matrix. Modified shifted QR algorithms (which can adapt to the structure of SB) are developed to solve the eigendecomposition of SB . We leave it at this. 164 Chapter 8. Singular Value Decomposition 8.3 Low Rank Matrix Approximation using SVD Recall the reduced singular value decomposition of a rank-k matrix A ∈ Rm×n, A = Uˆ ΣˆVˆ > = r∑ i=1 σi ~ui ~v > i . This decomposition into a summation of rank-one matrices, σi ~ui ~v > i , has a cele- brated property: the k-th partial sum captures the energy of the matrix A as much as possible. Here the “energy” is define by either the 2-norm or the Frobenius norm. Definition 8.15 Given the SVD of the matrix A ∈ Rm×n, the truncated singular value decomposition is defined by only retain the first k singular values, and first k left and right singular vectors. Let A = UΣV >, and then the truncated SVD takes the form of A ≈ Ak := Uk︸︷︷︸ U(:,1:k) Σk︸︷︷︸ Σ(1:k,1:k) V >k︸︷︷︸ V (:,1:k)> = k∑ i=1 σi ~ui ~v > i , (8.7) for k < r. The matrix Ak = ∑k i=1 σi ~ui ~v > i is a rank-k approximation to A. Theorem 8.16 Given a matrix A and its SVD, the rank-k approximation Ak where k < r defined by the truncated SVD provides the best approximation to A in either the 2-norm or the Frobenius norm. That is, ‖A−Ak‖2 ≤ ‖A−B‖2, for all B ∈ Rm×n with rank k. and ‖A−Ak‖F ≤ ‖A−B‖F , for all B ∈ Rm×n with rank k. Proof. This can be shown by contradiction. We omit the proof here. Example 8.17 A natural application of this theorem is that we can compress a data set or a picture using the truncated SVD. A matrix A ∈ Rm×n requires mn floating- point numbers of memory to store, whereas its truncated SVD only requires mk+nk+k = (m+n+1)k floating-point numbers. Following Theorem 8.9 and 8.10, the compression error, in terms of the Frobenius norm and the 2-norm can be given by the residual singular values after the truncation, i.e., ‖A−Ak‖F = √√√√ r∑ i=k+1 σ2i , ‖A−Ak‖2 = σk+1. 8.3. Low Rank Matrix Approximation using SVD 165 Considering the following grey scale picture (on the left) that consists of 900× 703 pixels, we can treat it as a matrix, and hence the truncated SVD can be applied to compress this image. The picture on the right shows the compressed image created by the truncated SVD with k = 30. 166 Chapter 8. Singular Value Decomposition 8.4 Pseudo Inverse and Least Square Problems using SVD Recall the linear least-squares (LS) problem: Definition 8.18: Least-Squares Problem Let A ∈ Rm×n with m > n. Find ~x that minimizes f(~x) = ‖~b−A~x‖22. Example 8.19: Polynomial Least Square Fitting Suppose we have m distinct points, s1, s2, . . . , sm ∈ R and data b1, b2, . . . , bm ∈ R observed at these points. We aim to find a polynomial of degree n− 1 p(x) = x1 + x2s · · ·+ xnsn−1 = n∑ i=1 xis i−1, defined by coefficients {xi}ni=1, that best fits the data in the least square sense. The relationship of the data {si}mi=1, {bi}mi=1 to the coefficients {xi}ni=1 can be expressed by the Vandermonde system as: 1 s1 s 2 1 s n−1 1 1 s2 s 2 2 s n−1 2 1 s3 s 2 3 s n−1 3 … … 1 sm−1 s2m−1 s n−1 m−1 1 sm s 2 m s n−1 m ︸ ︷︷ ︸ A x1 x2 x3 … xn ︸ ︷︷ ︸ ~x = b1 b2 b3 … bm−1 bm ︸ ︷︷ ︸ ~b To determine the coefficients {xi}ni=1 from data, we can solve a least square system A~x = ~b. The following figure presents an exam- ple of this process. We have 51 data points which is the func- tion sin(10s) observed at discrete points 0, 0.02, 0.04, . . . , 1, represented by crosses. We construct a polynomial of degree 11 to fit this data set. 8.4. Pseudo Inverse and Least Square Problems using SVD 167 Pseudoinverse One way to solve the least square problem is solving the normal equation ATA~x = AT~b. (8.8) This leads to the definition of pseudoinverse of a matrix. Definition 8.20 For a full rank matrix A ∈ Rm×n, the matrix (A>A)−1A> is called pseudoin- verse of A, denoted by A+, A+ = (A>A)−1A> ∈ Rn×m. Using the pseudoinverse, the solution of the normal equation can be expressed as ~x = A+~b. Defining the projector P = AA+ which is an orthogonal projector onto range(A), the solution ~x minimising the least square problem satisfies that A~x = P~b, where the right hand side is the data projected onto the range of A. Theorem 8.21 Given the pseudoinverse of matrix A, denoted by A+, the matrix P = AA+ is an orthogonal projector onto range(A). QR Solving Equation (8.8) is computationally fast but can be numerically unstable. The practical method for solving the least square problem uses the reduced QR factorisation A = QˆRˆ. This way, the projection onto the range of A is defined by P = QˆQˆ>. Then the equation A~x = P~b can be expressed as QˆRˆ~x = QˆQˆ>~b, and left-multiplication by Qˆ> leads to Rˆ~x = Qˆ>~b. (8.9) Remark 8.22 Multiplying by Rˆ−1 leads to an alternative definition of pseudoinverse in the form of A+ = Rˆ−1Qˆ>. (8.10) SVD Alternatively, SVD provides a geometrically intuitive way to understand and solve the least square problem. This is particularly useful for rank-deficient 168 Chapter 8. Singular Value Decomposition systems and the case m < n (e.g., the X-ray imaging). Suppose the matrix A ∈ Rm×n has a rank-r reduced SVD A = Uˆ ΣˆVˆ > = r∑ i=1 σi ~ui ~v > i . Recall that the columns of Vˆ span the row space of A, the columns of Uˆ spans the column space (range) of A, and Σˆ representing the stretching effect of the linear transformation. The left singular vectors define an orthogonal projector P = Uˆ Uˆ>. The data ~b can be projected onto the range of A, spanned by the columns of Uˆ . The projected data, P~b, can be expressed as a linear combination of the columns of Uˆ—the associated coefficients is defined by the vector Uˆ>~b ∈ Rr. Then the equation A~x = P~b can be expressed as Uˆ ΣˆVˆ >~x = Uˆ Uˆ>~b, and left-multiplication by Uˆ> leads to ΣˆVˆ >~x = Uˆ>~b. (8.11) Solving this equation we obtain the least square solution ~x = Vˆ Σˆ−1Uˆ>~b. (8.12) This way, we know that the least solution ~x is a linear combination of the columns of Vˆ—the associated coefficients is defined by the vector Vˆ >~x ∈ Rr—and hence it is in the row space of A. The least square system can be understood as the following: projecting the data to the range of the matrix A (defining ~q = Uˆ>~b ∈ Rr), we seek a solution ~x to the least square problem in the row space of A. Expressing the solution ~x as a linear combination of the columns of Vˆ , ~x = Vˆ ~p, where ~p = Vˆ >~x ∈ Rr, the least square problems reduce to a r-dimensional linear system Σˆ~p = ~q. Recall the geometric interpretation of SVD (Figure 8.1.2), solving the least square problem effectively inverts the stretching effect of a linear transform within the rank-r row space and column space of A. This will be the key to understanding X-ray imaging in the next section. Remark 8.23 SVD also defines the pseudoinverse of A in the form of A+ = Vˆ Σˆ−1Uˆ>. (8.13) 8.4. Pseudo Inverse and Least Square Problems using SVD 169 Algorithm 8.24: Least Squares via SVD Given a matrix A ∈ Rm×n and the data ~b ∈ Rm, the solution ~x of the least square problem f(~x) = ‖~b−A~x‖22 can be obtained as following: 1. Compute the reduced SVD, A = Uˆ ΣˆVˆ >. 2. Compute the vector ~q = Uˆ>~b ∈ Rr. 3. Solving the linear system Σˆ~p = ~q. 4. Set ~x = Vˆ ~p. 170 Chapter 8. Singular Value Decomposition 8.5 X-Ray Imaging using SVD In this section, we use an industrial process imaging problem as the example to demonstrate the X-ray imaging. The setup of the problem is demonstrated in Figure 8.5.1. The true object consists of three circular inclusions, each of uniform density, inside an annulus. Ten X-ray sources are positioned on one side of a circle, and each source sends a fan of 100 X-rays that are measured by detectors on the opposite side of the object. Here, the 10 sources are distributed evenly so that they form a total illumination angle of 90 degrees, resulting in a limited-angle X-ray problem. The goal is to reconstruct the density of the object (as an image) from measured X-ray signals. Figure 8.5.1: Left: discretised domain, true object, sources (red dots), and de- tectors corresponding to one source (black dots). The fan transmitted by one source is illustrated in gray. The density of the object is 0.006 in the outer ring and 0.004 in the three inclusions; the background density is zero. Right: the noise free measurements (black line) and the noisy measurements (red dots) for one source. 8.5.1 Mathematical Model When an X-ray travels through a physical object along a straight line l(s), where s is the spatial coordinate, interaction between radiation and matter lowers the intensity of the ray. Suppose that an X-ray has initial intensity I0 at the radiation source. The intensity measured at the detector I1 is smaller than I0, as the intensity of the X-ray decreases proportionally to the relative intensity loss of the matter along the line l. We can representing the relative intensity loss of the matter by an attenuation coefficient function f(s), whose value gives the relative intensity loss of the X-ray within a small distance ds, dI I = −f(s)ds. 8.5. X-Ray Imaging using SVD 171 Density of material is often correlated with the relative intensity loss. Material with a higher density (e.g., medal) often has higher attenuation coefficient than material with a lower density (e.g., wood). Thus, recovering the unknown at- tenuation coefficient function f(s) from X-ray signals is used as a surrogate for reconstructing the actual material density. Integration from the initial state to the final state along a line l(s) gives∫ l(s) I ′(s) I(s) = − ∫ l(s) f(s)ds, where the left hand side gives log(I1)− log(I0) = log( I1I0 ). Thus we have log(I0)− log(I1) = ∫ l(s) f(s)ds. Now the left hand side of the above equation is known from measurements (I0 by the equipment setup and I1 from detector), whereas the right hand side consists of integrals of the unknown function f(x) over straight lines. 8.5.2 Computational Model Figure 8.5.2: Left: discretised object and an X-ray travelling through it. Right: four pixels from the left side picture and the distances (in these pixels) travelled by the X-ray corresponding to the measurement d7. Distance ai,j corresponds to the element on the i-th row and j-th column of matrix F . Computationally we can represent the continuous function f(s) by n pix- els (or voxels in 3D), as shown in Figure 8.5.2. Now each component of ~x = [x1, x2, . . . , xn] > represents the value of the unknown attenuation coefficient func- tion f(s) in the corresponding pixel. Assuming we have a measurement di of the line integral of f(s) over line li(s), we can approximate di = ∫ li(s) f(s)ds = n∑ j=1 ai,jxj , where ai,j is the distance that the line li(s) “travels” in the j-th pixel correspond- ing to xj . If we have m measurements (m X-rays travels through the object), then we have a linear equation ~d = F~x, where Fij = ai,j and ~d = [d1, d2, . . . , dm] >. 172 Chapter 8. Singular Value Decomposition 8.5.3 Image Reconstruction We move from the problem of computing the observables ~d for a given an attenu- ation coefficient function to the image reconstruction. Consider the measurement process can be expressed as ~d = F~x+ ~e, where ~e represents possible measurement noise of the instrument (as all the real world measurements are noisy) and other source of errors in the modelling process. Remark 8.25 The error in the measurement process is not negligible. The process of determining ~d given a known ~x is called the forward problem. In contrast, image reconstruction is an inverse problem where we aim to recover ~x from measured data ~d. In many cases, especially in industrial imaging, the x-rays travel through the physical object only from a restricted angle of view and we often have m < n. This way, the reconstruction process is very sensitive to measurement error. To understand the reconstruction process and the role of measurement error, we generate a noise free data by ~dt = F~xt and its noise corrupted version ~dn for a given “true” test image ~xt, as shown in Figure 8.5.1. Furthermore, the reduced SVD of F , F = Uˆ ΣˆVˆ >, will also be used. Inverse Crime Figure 8.5.3: Left: reconstruction from noise free data. Right: reconstruction from noisy data. Given a measured data set ~dn, a natural thing to try is to recover ~x by using the pseudoinverse of F (as discussed in the previous section) as F ∈ Rm×n may not be invertible. This way, we have the reconstructed image, ~x+ = F+~dn = Vˆ Σˆ −1Uˆ>~dn. To demonstrate the impact of measurement error, we consider the following experiments: 1. Reconstruct ~x using the noise free data—this not realistic in practice. 2. Reconstruct ~x using the noisy data—the realistic case. 8.5. X-Ray Imaging using SVD 173 Figure 8.5.3 shows the reconstructed image for both experiments. The experi- ment 1 is often referred to as the inverse crime, or too-good-to-be-true recon- structions. This is a reconstruction given the perfect knowledge about measure- ment process and provided noisy free data. In practice, a small error in the data can lead to a rather large error in the reconstruction, as shown in the experiment 2. Thus, we aim to find reconstructions that is robust to error. Reconstruction using truncated SVD Consider the noisy data generated from a true image ~xt, ~dn = F~xt + ~e, the reconstruction using the pseudoinverse can be expressed as ~x+ = F+ ~dn = Vˆ Σˆ −1Uˆ>︸ ︷︷ ︸ F+ Uˆ Σˆ−1Vˆ >︸ ︷︷ ︸ F ~xt + Vˆ Σˆ −1Uˆ>︸ ︷︷ ︸ F+ ~e. Thus, we have ~x+ = Vˆ Vˆ >~xt + F+~e. The reconstructed image ~x+ consists of the true image ~xt projected onto the row space of F and the pseudoinverse multiplied by the noise F+~e. This way the reconstruction error ‖~x+ − ~xt‖ can be bounded as ‖~x+ − ~xt‖ = ‖(Vˆ Vˆ > − I)~xt + F+~e‖ ≤ ‖(I − Vˆ Vˆ >)~xt‖+ ‖F+‖‖~e‖, (8.14) by the triangle inequality. We know that the 2-norm of the pseudoinverse, ‖F+‖, is given by 1/σr. Then the error bound can be expressed as ‖~x+ − ~xt‖ ≤ ‖(I − Vˆ Vˆ >)~xt‖+ 1 σr ‖~e‖. (8.15) Thus, the reconstruction error is subject to the smallest nonzero singular values. The singular values of the example used here is shown in Figure 8.5.4. Figure 8.5.4: Singular values of F . 174 Chapter 8. Singular Value Decomposition To control the reconstruction error, one can use the truncated SVD to define the approximated pseudoinverse of F . Given the truncated SVD F ≈ k∑ i=1 σi ~ui ~v > i , for k < r, the rank-k approximated pseudoinverse can be defined as F+k = UkΣkV > k = k∑ i=1 1 σi ~ui ~v > i This way, the corresponding reconstruction error bound of the reconstructed image ~x+k = VkΣ −1 k U > k ~d. takes the form ‖~x+k − ~xt‖ ≤ ‖(I − VkV >k )~xt‖︸ ︷︷ ︸ Representation error + 1 σk ‖~e‖. (8.16) Figure 8.5.5 shows the reconstructed image using k = 50, k = 500, and k = 940. The left reconstructed image in Figure 8.5.5 has a rather large representation Figure 8.5.5: Left: k = 50. Middle: k = 500. Right: k = 940. error as we truncated the SVD too aggressively. The right reconstructed image in Figure 8.5.5 is not robust to noise as the last singular value σk in the truncated SVD is too small. The middle reconstructed image in Figure 8.5.5 seems to archive a suitable balance. L-curve We aim to find a k such that the reconstruction error is robust with respect to the noise ~e, while has a minimal representation error. If the true image ~xt and the noise ~e are known, we can precisely compute the reconstruction error to pick the best k that reconstructs the optimal image. However, both ~xt and ~e are unknown, so we have to derive heuristics for choosing the best k. We can measure the representation error as how well the reconstructed image fits the noisy data, in the form of ‖F~x+k − ~dn‖, and measure the robustness of the reconstruction by ‖~x+k ‖, which is bounded as ‖~x+k ‖ ≤ ‖VkV >k ~xt‖+ 1 σk ‖~e‖. 8.5. X-Ray Imaging using SVD 175 The smaller the former is, the better the reconstructed image explains the data, while the smaller the latter is, the reconstruction is more robust. For a suitably chosen k, the norm ‖F~x+k − ~dn‖ should close to the norm of measurement noise ‖~e‖. For a rather small k, we expect the norm ‖F~x+k − ~dn‖ can be rather large. If we increase k, then the norm ‖F~x+k − ~dn‖ should decrease until it reaches the order of measurement noise ‖~e‖. However, at the same time, the robustness of the reconstruction decreases if k is too large. Thus, we expect the norm of ~x+k increases drastically if the k is chosen such that σk is too small. We often plot the norm ‖~x+k ‖ (on the horizontal axes) versus the norm ‖F~x+k − ~dn‖ (on the vertical axes) with different k values. This leads to the so-called L- curve. Figure 8.5.6 shows the L-curve computed for the example used here. The corner (represented by the black dot) represents a reasonable k value that balance the norm ‖F~x+k −~dn‖ (fit to the data) and the norm ‖~x+k ‖ (reconstruction robustness). Figure 8.5.6: Top left: the L-curve. Top right: the reconstructed image using k = 840, which the corner on the L-curve. Bottom left: reconstruction robustness ‖~x+k ‖ versus the rank k. Bottom right: fit to the data ‖F~x+k − ~dn‖ versus the rank k. 176 Chapter 8. Singular Value Decomposition Chapter 9 Krylov Subspace Methods for Eigenvalues In this chapter we will consider solving the eigenvalue problems for very large martices A ∈ Rn×n. For example, in the X-ray imaging case, a 3-D image discre- tised into 100 intervals on each dimension (not even a very fine resolution image) has a million voxels to reconstruct and we may use the same order of number of X-rays in the reconstruction. This leads to an eigenvalue problems with a million dimensional matrix. In this scenario, it is no longer computationally feasible to directly apply eigenvalue solvers such as the QR algorithm that operates on the full matrix (with operation counts O(n3) in Phase 1 and O(n2) in each iteration of Phase 2). Instead of solving the original eigenvalue problem in Rn, we seek to project the original problem onto a lower dimensional subspace, the Krylov subspace, and then solve a reduced dimensional eigenvalue problem. In this Chapter, we will discuss two algorithms for computing eigenvalues using the Krylov subspace, the Arnoldi method and the Lanczos method, which are designed for general square matrices and symmetric matrices, respectively. 9.1 The Arnoldi Method for Eigenvalue Problems Objective We recall that the CG method and the GMRES method for solving linear systems with A~x = ~b minimises the residual A~x−~b projected onto the Krylov subspace generated by the matrix A and the vector ~b: Kk+1(~b,A) = span{~b,A~b,A2~b, . . . , Ak~b}. Given a general square matrix A ∈ Rn×n, the goal of the Arnoldi method is to construct an orthonormal basis Qk+1 of the Krylov subspace Kk+1(~b,A) for some k > 0 such that the projection of the matrix A onto Kk+1(~b,A) with respect to the basis of columns of Qk+1, Hk+1 = Q ∗ k+1AQk+1, Hk+1 ∈ R(k+1)×(k+1), is a Hessenberg matrix. Under certain technical conditions, the eigenvalues of the Hessenberg matrix Hk+1 (the so-called Arnoldi eigenvalue estimates) can be good approximations of the eigenvalues of A. 177 178 Chapter 9. Krylov Subspace Methods for Eigenvalues In the rest of this section, we will show the Arnoldi procedure for constructing such a matrix Qk+1 and some of its important properties for solving eigenvalue problems and linear systems. Arnoldi Procedure Recall that a complete reduction of A ∈ Rn×n to a Hessenberg form by a unitary similarity transformation can be written as H = Q∗AQ, or AQ = QH. In Phase 1 of the eigenvalue solvers we learned in Chapters 6 and 7, the matrix Q is constructed by a sequence of n−2 Householder reflections. For a large n, it is not feasible to apply this process which requires O(n3) operation counts. This way, we can only focus on the first k + 1 columns of AQ = QH. Furthermore, recall that for computing the QR factorisation of A, QR = A, we have discussed two methods: Householder reflection and (modified) Gram- Schmidt. While the former is more numerically stable, the (modified) Gram- Schmidt has the advantage that it can be stopped part-way, leaving one with a reduced QR factorisation. The process of using the Arnoldi procedure to construct the first k + 1 columns of AQ = QH draws an analogy to this. Arnoldi generates an orthonormal basis for the Krylov space Kk+1(~b,A) by setting ~q0 = ~b/‖~b‖, and applying modified Gram-Schmidt to orthogonalise the vectors {~q0, A~q0, A~q1, . . . , A~qk}. In every iteration, the Arnoldi method computes a vector A~qk, and orthog- onalise this vector to the previous {~q0, ~q1, . . . , ~qk} using the modified Gram- Schmidt process to generate a new vector ~qk+1. This is essentially subtracting from A~qk the components in the directions of the previous ~qj : ~vk+1 = A~qk − h0,k~q0 − h1,k~q1 − . . .− hk,k~qk, where the projection coefficients hj,k are determined as hj,i = (A~qk) ∗~qi. The new orthonormal vector ~qk+1 is then determined by normalising ~vk+1: ~qk+1 = ~vk+1/hk+1,k where hk+1,k = ‖~vk+1‖. So the basis vectors {~q0, ~q1, . . . , ~qk, ~qk+1} satisfy hk+1,k~qk+1 = A~qk − h0,k~q0 − h1,k~q1 − · · · − hk,k~qk, or A~qk = h0,k~q0 + h1,k~q1 + · · ·+ hk,k~qk + hk+1,k~qk+1. This procedure to generate an orthonormal basis of the Krylov space is called the Arnoldi procedure. It can easily be shows that the resulting set of Arnoldi vectors, {~q0, ~q1, . . . , ~qk, ~qk}, is a basis for Kk+1(~b,A) = span{~b,A~b,A2~b, . . . , Ak~b}. Theorem 9.1 Let {~q0, . . . , ~qk} be the vectors generated by the Arnoldi procedure. Then span{~q0, . . . , ~qk} = span{~b,A~b,A2~b, . . . , Ak~b}. 9.1. The Arnoldi Method for Eigenvalue Problems 179 Proof. This can be shown by induction. The case for k = 0 is trivial. For k > 0, suppose that span{~q0, . . . , ~qk} = span{~b,A~b,A2~b, . . . , Ak~b}, holds. Given the relationship between A~qk and the Arnoldi vectors: A~qk = h0,k~q0 + h1,k~q1 + · · ·+ hk,k~qk + hk+1,k~qk+1, we know that the vector A~qk is a linear combination of {~q0, . . . , ~qk, ~qk+1}. Thus, we have span{~q0, . . . , ~qk, , ~qk+1} = span{~b,A~b,A2~b, . . . , Ak~b,Ak+1~b}. The Arnoldi procedure is given by: Algorithm 9.2: Arnoldi Procedure for an Orthonormal Basis of Kk+1(~b0, A) Input: matrix A ∈ Rn×n; vector ~b0 Output: vectors ~q0, . . . , ~qk that form an orthonormal basis of Kk+1(~b0, A) 1: ~q0 = ~b0/‖~b0‖ 2: for i = 0 : (k − 1) do 3: ~v = A~qi 4: for j = 0 : i do 5: hj,i = ~q ∗ j ~v 6: ~v = ~v − hj,i~qj 7: end for 8: hi+1,i = ‖~v‖ 9: if hi+1,i < tol then 10: Stop 11: end if 12: ~qi+1 = ~v/hi+1,i 13: end for We can express the vectors and coefficients computed during the Arnoldi procedure in a matrix form of: A ~q0 ~q1 · · · ~qk ︸ ︷︷ ︸ n-by-(k+1) = ~q0 ~q1 · · · ~qk ~qk+1 ︸ ︷︷ ︸ n-by-(k+2) h0,0 h0,1 h0,2 . . . h0,k h1,0 h1,1 h1,2 h2,1 h2,2 . . . ... h3,2 . . . . . . . . . 0 hk,k−1 hk,k hk+1,k ︸ ︷︷ ︸ (k+2)-by-(k−1) , 180 Chapter 9. Krylov Subspace Methods for Eigenvalues or AQk+1 = Qk+2H˜k+1. Projection onto Krylov Subspaces We can partition the matrix H˜k+1 as: H˜k+1 = h0,0 h0,1 h0,2 . . . h0,k h1,0 h1,1 h1,2 h2,1 h2,2 . . . ... h3,2 . . . . . . . . . 0 hk,k−1 hk,k hk+1,k = [ Hk+1 hk+1,k~e > ] , where Hk+1 is a (k + 1)-by-(k + 1) square Hessenberg matrix. Note that the product Q∗k+1Qk+2 = [ I ~0 ] , which is a (k + 1)-by-(k + 2) identity matrix, i.e., a matrix with 1 on its main diagonal and zero elsewhere. Then, we have Q∗k+1AQk+1 = Q ∗ k+1Qk+2H˜k+1 = Hk+1. The matrix Hk+1 can be interpreted as the representation in the basis of columns of Qk+1 of the matrix A projected onto the Krylov subspace Kk+1. Since the Hessenberg matrix Hk+1 is a projection of A, one might image that the eigenvalue of Hk+1 can be related to the eigenvalue of A. In fact, under certain conditions, the eigenvalues Hk+1 (the so-called Arnoldi eigenvalue estimates) can be very accurate approximations of the eigenvalues of A. This will be shown in later sections. Breakdown Note also that, when k+ 1 = n, the process terminates with hn,n−1 = ‖~vn‖ = 0, because there cannot be more than n orthogonal vectors in Rn. At this point, we obtain AQ = QH, where Q ∈ Rn×n is orthogonal and H ∈ Rn×n is a square Hessenberg matrix with zeros below the first subdiagonal: H = h0,0 h0,1 h0,2 . . . h0,n−1 h1,0 h1,1 h1,2 h2,1 h2,2 . . . … h3,2 . . . . . . . . . 0 hn−1,n−2 hn−1,n−1 . 9.1. The Arnoldi Method for Eigenvalue Problems 181 In practice, the Arnoldi procedure will be terminated if the value hk+1,k = ‖~vk+1‖ is close to zero, say, below certain threshold (Lines 9-11 in Algorithm 9.2). This is called a breakdown of the Arnoldi procedure. Very often and hopefully, the breakdown can occur before k + 1 = n. The breakdown means exact eigenvalues (up to some numerical error) of A can be obtained from the matrix Hk+1 and exact solutions of the linear system A~x = ~b can be obtained. Remark 9.3 Once a breakdown occurs, we have hk+1,k = 0, and then H˜k+1 = [ Hk+1 0T ] . It then follows that AQk+1 = Qk+2H˜k+1 = [Qk+1 | ~qk+1] [ Hk+1 0T ] = Qk+1Hk+1. Remark 9.4 Consider that Qk+1 is the first k+ 1 columns of an unitary matrix Q that can reduce the matrix A to a Hessenberg form, i.e., AQ = QH or Q∗AQ = H. In this case the matrix H for the full Hessenberg reduction has the following structure H = [ Hk+1 H12 0 H22 ] , where H12 is a potentially full (k + 1) × (n − k − 1) matrix and H22 is an (n − k − 1) × (n − k − 1) upper Hessenberg matrix. Thus, H is block upper triangular. Then the union of the eigenvalues of Hk+1 and the eigenvalues of H22 are the eigenvalues of A. Remark 9.5 It is easy to verify that is Hk+1 has an eigenvalue λ with an eigenvector ~v, then λ is an eigenvalue of A and A has a corresponding eigenvector Qk+1~v. Proof. Let λ be an eigenvalue of Hk+1 with corresponding eigenvector ~v, i.e., Hk+1~v = λ~v. Let ~y = Qk+1~v, then A~y = AQk+1~v = Qk+1Hk+1~v, as given in (i). Since Hk+1~v = λ~v, we have A~y = λQk+1~v = λ~y, Since ~v 6= 0, and since the columns of Qk+1 are linearly independent, it follows that ~y 6= 0, and hence λ is an eigenvalue of A with eigenvector ~y. 182 Chapter 9. Krylov Subspace Methods for Eigenvalues Theorem 9.6 Once a breakdown occurs at an iteration k, the Krylov subspace Kk+1(~b,A) = span{~b,A~b,A2~b, . . . , Ak~b} is an invariant subspace of A, i.e., AKk+1 ⊆ Kk+1. Proof. Let ~y be an arbitrary vector in AKk+1, then there exists a vector ~z ∈ Kk+1 such that ~y = A~z. Since Kk+1 = span{~q0, · · · , ~qk}, we can express ~z as a linear combination of {~q0, · · · , ~qk} in the form of ~z = Qk+1 ~w for some ~w ∈ Rk+1. It follows that ~y = AQk+1 ~w = Qk+1Hk+1 ~w. This implies that ~y ∈ span{~q0, · · · , ~qk}. Since ~y is arbitrary it follows that AKk+1 ⊆ Kk+1. Theorem 9.7 Once a breakdown occurs at an iteration k, the Krylov subspaces of A gener- ated by b, Kk+1(~b,A) = span{~b,A~b,A2~b, . . . , Ak~b}, have the following property: Kk+1 = Kk+2 = Kk+3 = · · · . Proof. First we have that Kk+1 ⊆ Kk+2 by the definition of Krylov subspace. The Krylov subspace Kk+2 is the union of span{~q0} and the subspace AKk+1, i.e., Kk+2 = span{~q0} ∪AKk+1. After the breakdown, we have AKk+1 ⊆ Kk+1 as the result of Theorem 9.6. Since span{~q0} ⊆ Kk+1 by definition, we have Kk+2 ⊆ Kk+1. Thus, Kk+1 = Kk+2. Then, we can prove this theorem by induction. Theorem 9.8 Suppose that the matrix A is nonsingular. Once a breakdown occurs, the solu- tion to the linear system A~x = ~b lies in Kk+1. Proof. If A is nonsingular, then by the result of Remark 9.4, zero can- not be an eigenvalue of Hk+1. Therefore, Hk+1 is an invertible matrix and AQk+1H −1 k+1 = Qk+1. Since ~b = Qk+1 ( ~e1‖~b‖ ) , it follows that AQk+1H −1 k+1 ( ~e1‖~b‖ ) = Qk+1 ( ~e1‖~b‖ ) = ~b. Multiplying both sides on the left by A−1 we obtain Qk+1H −1 k+1 ( ~e1‖~b‖ ) = A−1~b = ~x. Thus, we have ~x ∈ Kk+1. 9.2. Lanczos Method for Eigenvalue Problems 183 9.2 Lanczos Method for Eigenvalue Problems The Lanczos method is the Arnoldi method specialised to the case where the matrix A is symmetric. If A = AT , then the Hessenberg matrix obtained by the Arnoldi process (for the case k + 1 = n) satisfies HT = (QTAQ)T = QTATQ = QTAQ = H, so H is symmetric, which implies that it is tridiagonal. Therefore, the Arnoldi update formula simplifies from hk+1,k~qk+1 = A~qk − h0,k~q0 − h1,k~q1 − . . .− hk,k~qk. to a three-term recursion relation hk+1,k~qk+1 = A~qk − hk−1,k~qk−1 − hk,k~qk, with A ~q0 ~q1 · · · ~qk ︸ ︷︷ ︸ n-by-(k+1) = ~q0 ~q1 · · · ~qk ~qk+1 ︸ ︷︷ ︸ n-by-(k+2) h0,0 h0,1 0 h1,0 h1,1 h1,2 h2,1 h2,2 h2,3 … h3,2 h3,3 . . . . . . . . . 0 hk,k−1 hk,k hk+1,k ︸ ︷︷ ︸ (k+2)-by-(k−1) . Taking into account the symmetry further and use the notation T˜ instead of H˜ to denote a tridiagonal matrix, we have T˜k+1 = α0 β0 β0 α1 β1 0 β1 α2 β2 . . . . . . . . . . . . . . . 0 βk−1 αk βk . In a matrix form, we have AQk+1 = Qk+2T˜k+1. The simplification of the Arnoldi procedure to compute the orthonormal basis {~q0, . . . , ~qi} of the Krylov space based on A~qk = βk−1~qk−1 + αk~qk + βk~qk+1, 184 Chapter 9. Krylov Subspace Methods for Eigenvalues where αk = (A~qk) ∗~qk, βk−1 = (A~qk)∗~qk−1, and βk+1 is the 2-norm of A~qk − βk−1~qk−1 − αk~qk. Note that βk−1 was obtained in the previous iteration. This procedure is called the Lanczos procedure. It can be shown that the Lanczos procedure is related to the CG algorithm (just like Arnoldi is used by GMRES). Properties of the Arnoldi procedure, for example, Remark 9.3 – Theorem 9.7 still hold for the Lanczos procedure. The Lanczos procedure is given by: Algorithm 9.9: Lanczos Procedure for an Orthonormal Basis of Kk+1(~b0, A) Input: a symmetric matrix A ∈ Rn×n; vector ~b0 Output: vectors ~q0, . . . , ~qk that form an orthonormal basis of Kk+1(~b0, A) 1: β−1 = 0, ~q−1 = 0, ~q0 = ~b0/‖~b0‖ 2: for i = 0 : (k − 1) do 3: ~v = A~qi 4: αi = ~q ∗ i ~v 5: ~v = ~v − αi~qi − βi−1~qi−1 6: βi = ‖~v‖ 7: if βi < tol then 8: Stop 9: end if 10: ~qi+1 = ~v/βi+1 11: end for Remark 9.10 Note that each iteration of the Lanczos procedure only operates with three vectors, as opposite to the i+ 1 vectors used in the Arnoldi iteration. 9.3. How Arnoldi/Lanczos Locates Eigenvalues 185 9.3 How Arnoldi/Lanczos Locates Eigenvalues Eigenvalues of a matrix is defined by the characteristic polynomial. The Arnoldi/ Lanczos procedure implicitly constructs a sequence of polynomials that approx- imates the characteristic polynomial of a matrix. The use of Arnoldi/Lanczos procedure for computing eigenvalues proceeds as follows. For a matrix A ∈ Rn×n, after k Arnoldi/Lanczos iterations, the eigen- values and eigenvector of the resulting Hessenberg matrix Hk+1 are computed by standard eigenvalues solvers such as the shifted QR algorithm. These are the Arnoldi estimates of eigenvalues. For a large matrix, we often can only perform the Arnoldi/Lanczos proce- dure k n number of iterations. In this case, we can only obtain estimates to a maximum of k + 1 eigenvalues. Some of these eigenvalue estimates con- verges faster to an eigenvalue of A and some of the estimates converges slower. Typically, estimates of those “extreme” eigenvalues converges faster. That is, eigenvalues near the edge of the spectrum, or eigenvalues have a big gap with adjacent eigenvalues. Here we want to illustrate the idea behind the Arnoldi/Lanczos procedure, why it tends to find those extreme eigenvalues. Arnoldi and Polynomial Approximation Let ~x be a vector in the Krylov subspace Kk(~b,A) generated by the matrix A and the vector ~b: Kk(~b,A) = span{~b,A~b,A2~b, . . . , Ak−1~b}. Such an ~x can be expressed a linear combination of powers of A times ~b, ~x = c0~b+ c1A~b+ · · ·+ ck−1Ak−1~b = k−1∑ j=0 cjA j−1~b. This expression can also be defined as a polynomial of A multiplied by ~b. If p(z) is a polynomial c0 + c1z + · · · + ck−1zk−1, then we have the matrix polynomial of A in the form of p(A) = c0 + c1A+ · · ·+ ck−1Ak−1 = k−1∑ j=0 cjA j−1. This way, we have ~x = p(A)~b. Krylov subspace methods can be analysed in terms of matrix polynomials. Definition 9.11 A monic polynomial of degree k is defined as a polynomial pk(z) = c0 + c1z + · · ·+ ck−1zk−1 + zk. That is, the coefficient associated with degree k is 1. Remark 9.12 The characteristic polynomial of a matrix A, pA(λ), is a monic polynomial. 186 Chapter 9. Krylov Subspace Methods for Eigenvalues Theorem 9.13 Consider the characteristic polynomial of a matrix A, pA(λ), The Cayley- Hamilton Theorem asserts that the matrix polynomial pA(A) = 0. Proof. This can be easily verified for that case where the matrix A has an eigendecomposition, A = V λV −1. We omit the general proof here. Remark 9.14 The Arnoldi/Lanczos procedure finds a monic polynomial pk(·) such that ‖pk(A)~b‖ is minimised. (9.1) Once a breakdown occurs, it is not hard to show that the Arnoldi procedure obtains a monic polynomial such that ‖pk(A)~b‖ = 0. Here we want to look into this problem before a breakdown. Theorem 9.15 As long as the Arnoldi procedure does not breakdown (i.e., the Krylov sub- space Kk(~b,A) is of rank k), the chracteristic polynomial of HK defines the polynomial solving the problem (9.1). Proof. We first note that if pk is a monic polynomial, then the vector pkA~b can be written as pk(A)~b = k−1∑ j=0 cjA j−1 ~b ︸ ︷︷ ︸ =−Qk~y∈Kk(~b,A) +Ak~b = Ak~b−Qk~y, for some ~y ∈ Rk. Since Qk is full rank (of rank k), the problem (9.1) becomes a least square problem of finding ~y such that ‖Ak~b−Qk~y‖ is minimised. The solution can be obtained at Q∗k(A k~b−Qk~y) = 0, or equiva- lently Q∗kp k(A)~b = 0. Now the problem boils down to find the monic polynomial that solves the above equation. Consider the following unitary matrix Q = [ Qk U ] , where Qk is the the matrix consists of the Arnoldi vectors, the first column of U is the Arnoldi vector ~qk+1 and the other columns of U are orthonormal 9.3. How Arnoldi/Lanczos Locates Eigenvalues 187 vectors. This way, we have the unitary similarity transformation of the matrix A, which takes the form of Q∗AQ = [ Q∗kAQk Q ∗ kAU U∗AQk U∗AU ] . Since AQk = Qk+1H˜k, we have Q ∗ kAQk = Hk and X1 = U ∗AQk = U∗Qk+1H˜k is a matrix of dimension (n− k)-by-k, with all but the upper-right entry equal to 0. Let X2 = Q ∗ kAU and X3 = U ∗AU , we have Q∗AQ = [ Hk X2 X1 X3 ] = H, which is block Hessenberg. Since A = QHQ∗, we can show that pk(A) = Qpk(H)Q∗. Thus, we have Q∗kp k(A)~b = Q∗kQp k(H)Q∗~b. Given ~b = Q~e1‖~b‖ and Q∗kQ = [ Ik 0 ] , the above equation can be written as Q∗kp k(A)~b = [ Ik 0 ] pk(H)~e1‖~b‖, which is essentially the first k entries of the first column of pk(H). Because of the block Hessenberg structure of H, the first k entries of the first column of pk(H) can be given by pk(Hk). If p k(·) is the characteristic polyno- mial of Hk, i.e., p k(λ) = pHk(λ), then by the Cayley-Hamilton Theorem, the matrix polynomial pk(Hk) equals to 0. This way, the characteristic polynomial of Hk defines a polynomial solving the problem (9.1). How Arnoldi/Lanczos Locates Eigenvalues By projecting the matrix A onto the Krylov subspaceKk(~b,A) represented byQk, we obtain a matrix Hk. The characteristic polynomial of Hk effectively solves a polynomial approximation problem, or equivalently, a least square problem involving the Krylov subspace. What does the characteristic polynomial of Hk have to do with the eigen- values of A, or equivalently, the characteristic polynomial of A? There is a connection between these. If a polynomial pk(·) has the property that pk(A) is small, effectively we can find the root of pk(·) that are close the roots of pA(·). Remark 9.16 We can express the vector ~b as a linear combination of eigenvectors, ~v1, ~v2, . . . associated with coefficients a1, a2, . . ., in the form of ~b = n∑ i=1 aj~vj . 188 Chapter 9. Krylov Subspace Methods for Eigenvalues Since p(A)~vi = k∑ j=1 cjA j−1~vi = k∑ j=1 cjλ j−1 i ~vi = p(λi)~vi the vector p(A)~b can be written as p(A)~b = n∑ i=1 aip(λi)~vi. Thus, the eigenvalue estimates obtained from the the Arnoldi procedure de- pend on the quality of the approximation to p(λi) weighed by ai. Remark 9.17 If the vector ~b is the linear combination of a limited number of eigenvectors. The Arnoldi will find the monic polynomial such that ‖p(A)~b‖ = 0, as soon as p(A)~b can be contained by a Krylov subspace, which is exactly the Krylov subspace after the breakdown, or equivalently, the subspace spanned by all the eigenvectors used for constructing ~b. Example In general, the shape of the characteristic polynomial is dominated by “extreme” eigenvalues. Here we illustrate this idea using the following example. Let A be a 19-dimensional matrix A = diag([0.1, 0.5, 0.6, 0.7, . . . , 1.9, 2.0, 2.5, 3.0]). The spectrum of A consists of a dense collection of eigenvalues in the interval [0.5, 2.0] and some outliers 0.1, 2.5, and 3.0, as shown below. The crosses are the eigenvalue and the blue line is the characteristic polynomial. We carry out the Lanczos procedure with a random starting vector ~b0. Figure 9.3.1 plots the monic polynomials obtained in selected iterations the Lanczos procedure and their roots. We can observe that the outlier eigenval- ues are identified first, followed by the eigenvalues on the edge of the interval [0.5, 2.0]. Those eigenvalues in the middle of the cluster are identified the last. In summary, those eigenvalue estimates in the region where the characteristic poly- nomial changes more rapidly converges faster than those in the region where the characteristic polynomial is flat. 9.3. How Arnoldi/Lanczos Locates Eigenvalues 189 Figure 9.3.1: Estimated monic polynomials obtained by the Lanczos procedure. 190 Chapter 9. Krylov Subspace Methods for Eigenvalues Chapter 10 Other Eigenvalue Solvers So far all the eigenvalue solvers we have learned involves some polynomials of a matrix A. For example, the power iteration, inverse iteration, or more advanced QR algorithms raise the matrix to some power, and Krylov subspace methods implicitly construct a complicated matrix polynomial. There is more to the computation of eigenvalues than using matrix polynomials. Here we introduce some alternatives for computing eigenvalues. 10.1 Jacobi Method One of the oldest idea for computing eigenvalues of a matrix is the Jacobi method, introduced by Jacobi in 1845. Consider a symmetric matrix that of dimension 5 or larger. We know that we have apply iterative method to approximate the eigenvalues. We also know that a real valued symmetric matrix A has an eigen- decomposition A = QΛQ>, where Q is orthogonal and Λ is diagonal. Now the question is that can we create a sequence of orthogonal similarity transformations such that each transformation will transform the matrix to a “more diagonal” form? This way, the sequence of transformations will eventually produce a diag- onal matrix. The Jacobi method use a sequence of 2-by-2 rotation matrix, called the Jacobi rotation, which are chosen to eliminate off-diagonal elements while preserving the eigenvalues. Whilst successive rotations will undo previous introduced zeros, the off-diagonal elements get smaller until eventually we are left with a diagonal matrix. By accumulating products of the transformations as we proceed we obtain the eigenvectors of the matrix. Consider a 2-by-2 symmetric matrix, A = [ a d d b ] , we aim to find a rotation matrix J such that J>AJ = [6= 0 0 0 6= 0 ] . 191 192 Chapter 10. Other Eigenvalue Solvers Definition 10.1 A 2-by-2 rotation matrix is an orthogonal matrix J = [ cos(θ) sin(θ) − sin(θ) cos(θ) ] = [ c s −s c ] , for some θ. It can be shown that for θ = 0.5 tan−1 ( 2d b− a ) , the resulting rotation matrix J can diagonalise the 2-by-2 matrix A. For a large matrix A ∈ Rn×n where n > 4, we cannot directly diagonalise the matrix. However, we can diagonalise a 2-by-2 submatrix each time using the abovementioned Jacobi rotation. Consider we want to rotate the submatrix A(p,q : p,q). We can first create a 2-by-2 Jacobi rotation matrix based on the θ angle evaluated using the submatrix A(p,q : p,q). Then, we can embed this Jacobi matrix in a n-dimensional identity matrix Qp,q,θ = 1 . . . 1 c s 1 −s s 1 . . . 1 , where all diagonal elements are 1 apart from two elements c in rows p and q, and all off-diagonal elements are zero apart from the elements s and −s in rows and columns q and q. Then the orthogonal similarity transformation A˜ = Q>p,q,θAQp,q,θ will modifies the p-th and q-th rows and columns of the matrix A. This orthogonal similarity transformation A˜ = Q>p,q,θAQp,q,θ has several important properties: 1. Eigenvalues are preserved as it is a similarity transformation. 2. Frobenius norm is preserved as it is an orthogonal transformation. 3. From 2, we know that for the 2-by-2 submatrices A˜(p,q : p,q) andA(p,q : p,q), we have ∑ p,q A˜2pq = ∑ p,q A2pq. Thus, the p-th and q-th diagonal elements of A˜ and A have the following property A˜2pp + A˜ 2 qq ≥ A2pp +A2qq, as A˜pq = A˜qp = 0. 10.1. Jacobi Method 193 Since the matrix property has been preserved in the orthogonal similarity trans- formation, and the transformed matrix is “more diagonal” than the previous one. If we repeatedly apply the Jacobi rotation to a matrix, the matrix will be eventually diagonalised. One benefit of Jacobi method is that it usually has a better accuracy than QR algorithms. The Jacobi method is also very easy to parallelise, as we only need to modify two rows and two columns in each operation. However, matrix reductions such as tridiagonalisation can not be used, as the Jacobi rotation can destroy the tridiagonal structure. In general, Jacobi method is computationally less efficiently than the QR algorithms using the tridiagonal reduction. 194 Chapter 10. Other Eigenvalue Solvers 10.2 Divide-and-Conquer The divide-and-conquer algorithm, based on a recursive subdivision of a sym- metric tridiagonal eigenvalue problem into problems of smaller dimensions, rep- resents the most important advances of eigenvalue problems since 1960s. For symmetric matrices, the divide-and-conquer algorithm outperformed shifted QR algorithm, particularly for cases both eigenvalues and eigenvectors are desired, and became the industrial standard in late 1990s. Here illustrate the idea behind this powerful method. Consider we have a n-by-n symmetric tridiagonal matrix, T = a1 b1 b1 a2 b2 b2 a3 . . . . . . . . . an−1 bn−1 bn−1 an . where all the entries on the subdiagonal and superdiagonal are nonzero, so that the eigenvalue problem cannot be deflated. The matrix T can be split and par- titioned into the following matrices: Here T1 = T (1:k, 1:k) and T2 = T (k+1:n, k+1:n) are the upper-left princi- pal submatrix and lower-right principal submatrix of T , respectively, and β = T (k+1, k) = T (k, k+1). The only difference between Tˆ1 and T1 is that lower- right entry of T1 is replaced by T1(k, 1)−β. A similar modification is also applied to T2 to obtain Tˆ2. Now we can write the tridiagonal matrix A as the summation of a 2-by-2 block-diagonal matrix with tridiagonal blocks and a rank one update. Since the eigenvalues of Tˆ1 and Tˆ2 can be solved separately, we can first find the eigen- vector and eigenvalues of two reduced dimensional matrices, and then express the eigenvalues of T as a function of eigenvalues of Tˆ1 and Tˆ2 and the rank one update. Since the submatrices Tˆ1 and Tˆ2 are also symmetric and tridiagonal, we can recursively apply this procedure to divide the problem into eigenvalues prob- lems of small matrices where we can apply either analytically formula or other computational methods that are computationally efficient for small matrices. The key step in this recursive process is to identify the eigendecomposition of T given the eigendecompositions of Tˆ1 and Tˆ2. Consider we have computed the eigendecompositions of Tˆ1 and Tˆ2, Tˆ1 = Q1D1Q > 1 , and Tˆ2 = Q2D2Q > 2 . Since we can express the matrix T as T = [ Tˆ1 Tˆ2 ] + β~y~y>, 10.2. Divide-and-Conquer 195 where ~y = [~e>k ~e > 1 ] > is a vector that have all elements are zero valued except the value 1 in k-th and (k + 1)-th entries. Introducing an orthogonal matrix Q = [ Q1 Q2 ] , the matrix Q>TQ can be written as Q>TQ = [ Q>1 Q>2 ]([ Tˆ1 Tˆ2 ] + β [ ~ek ~e1 ] [ ~ek ~e1 ]>)[ Q1 Q2 ] = [ D1 D2 ] ︸ ︷︷ ︸ D +β [ ~z1 ~z2 ] ︸︷︷︸ ~z [ ~z1 ~z2 ]> , where ~z1 is the last row of Q1 and ~z1 is the fir row of Q2. Now the problem is reduced to find the eigenvalues and eigenvectors of D + β~z~z>, which is a diagonal matrix plus a rank one update. Suppose all the entries of the vector ~z is nonzero, otherwise the eigenvalue problem can be deflated. Let dj = D(j, j and zj = ~z(j). The eigenvalue of this matrix is simply the roots of the polynomial f(λ) = 1 + β n∑ j=1 z2j dj − λ. The roots of this function is contained in intervals (dj , dj+1), as shown below. The roots of this polynomial can be rapidly identified using methods such as the Newton’s method, as we know exactly the intervals where each eigenvalue lies in and the function f(λ) is a monotone function in each interval. The above assertion can be justified by considering the eigenvalue and eigen- vector of D + β~z~z>, which take the form of (D + β~z~z>)~q = λ~q, 196 Chapter 10. Other Eigenvalue Solvers which leads to (D − λI)~q + β~z(~z>~q) = 0. Remark 10.2 Here ~z>~q cannot be zero. We can show this by contradiction. If ~z>~q = 0, then we have (D − λI)~q = 0, so the vector ~q is an eigenvector of D, which means ~q has only one nonzero element as D is diagonal. This way, ~z>~q 6= 0 as all entries of ~z are nonzero. Remark 10.3 We also note that λ cannot be eigenvalues of D. We can show this by con- tradiction. If λ is an eigenvalue of D, then D − λI has zeros entries on the diagonal, as eigenvalues of D are those entries on the diagonal. This way, the vector (D − λI)~q has zero entries. Since ~z>~q 6= 0 and all entries of ~z are nonzero, all entries of the vector β~z(~z>~q) are nonzero. Thus, the vector (D− λI)~q + β~z(~z>~q) must have nonzero entries, which is contradiction to the assumption that λ and ~q are eigenvalue and eigenvector of D + β~z~z>. Use Remark 10.3, we know that D − λI is invertible, so multiplying both sides of the above equation by (D − λI)−1 (on the right), we have ~q + β(D − λI)−1~z(~z>~q) = 0, then multiplying both sides of the above equation by ~z> (on the right) leads to ~z>~q + β~z>(D − λI)−1~z(~z>~q) = (~z>~q) (1 + β~z>(D − λI)−1~z))︸ ︷︷ ︸ f(λ) = 0. Since ~z>~q 6= 0 by Remark 10.2, we have f(λ) = 0 for all eigenvalues. Appendix A Appendices 197 198 Appendix A. Appendices A.1 Notation A.1.1 Vectors and Matrices • ~x is a column vector in Rn; xi is the ith component of ~x. • We may also write ~x = (x1, . . . , xn)T , where ~xT = (x1, . . . , xn) is a row vector. • A is a matrix in Rm×n. The element of A in row i and column j is referred to by aij . • The jth column of matrix A is referred to by ~aj . So A = [~a1| . . . |~an ] . • Sometimes we use the notation (A)ij for the element of A in position ij. For example, we can say (AT )ij = aji. We can also use this to refer to a row of A (as a row vector): the ith row of A can be indicated by (A)i∗, where the ∗ means all columns j. Something to remember . . . In these notes, all vectors ~x are column vectors. A.1.2 Inner Products We express the standard Euclidean inner product of vectors ~x, ~y ∈ Rn, in one of the following equivalent ways: ~xT~y = < ~x, ~y > . Of course, we also have ~xT~y = ~yT~x =< ~x, ~y >=< ~y, ~x > . Similarly, < ~x,A~y > = ~xTA~y = (AT~x)T~y =< AT~x, ~y >, since (AB)T = BTAT . A.1.3 Block Matrices Example A.1: Matrix-Matrix Product in Block Form Let E ∈ R5×7 and F ∈ R7×6. When performing the matrix product E F , we can divide E and F in blocks with compatible dimensions, and write the matrix-matrix product in block form as A.1. Notation 199 200 Appendix A. Appendices A.2 Vector Norms A.2.1 Vector Norms Definition A.2: Norm on a Vector Space Let V be a vector space over R. The function ‖ · ‖ : V → R is a norm on V if ∀ ~x, ~y ∈ V and ∀ a ∈ R, the following hold: 1. ‖~x‖ ≥ 0, and ‖~x‖ = 0 iff ~x = 0 2. ‖a~x‖ = |a|‖~x‖ 3. ‖~x+ ~y‖ ≤ ‖~x‖+ ‖~y‖ Definition A.3: p-Norms on Rn Let ~x ∈ Rn. We consider the following vector norms ‖~x‖p, for p = 1, 2,∞: ‖~x‖2 = √√√√ n∑ i=1 x2i = √ ~xT~x ‖~x‖1 = n∑ i=1 |xi| ‖~x‖∞ = max 1≤i≤n |xi| Theorem A.4: Cauchy-Schwarz Inequality Let ~x, ~y ∈ Rn. Then |~xT~y| ≤ ‖~x‖2‖~y‖2. A.2.2 A-Norm The vector 2-norm is induced by the Euclidean inner product: ‖~x‖2 = √ ~xT~x = √ < ~x, ~x >. More generally, if A ∈ Rn×n is symmetric positive definite, it can be used to define an A-inner product, which induces the A-norm. Definition A.5: A-Inner Product Let A ∈ Rn×n be symmetric positive definite, and ~x, ~y ∈ Rn. Then < ~x, ~y >A =< ~x,A~y > = ~xTA~y is called the A-inner product of ~x and ~y. A.2. Vector Norms 201 Definition A.6: A-Norm on Rn Let A ∈ Rn×n be symmetric positive definite. Then ‖~x‖A = √ < ~x, ~x >A = √ ~xTA~x is a norm on Rn, called the A-norm. Note that we recover the 2-norm for A = I. 202 Appendix A. Appendices A.3 Orthogonality Definition A.7: Orthogonal Vectors ~x, ~y ∈ Rn are orthogonal if ~xT~y = 0. We may also write ~xT~y = 0 as < ~x, ~y >= 0. Theorem A.8: Pythagorean Law If ~x and ~y are orthogonal, then ‖~x+ ~y‖22 = ‖~x‖22 + ‖~y‖22. Proof. ‖~x+ ~y‖22 =< ~x+ ~y, ~x+ ~y > =< ~x, ~x > +2 < ~x, ~y > + < ~y, ~y > = ‖~x‖22 + ‖~y‖22 Definition A.9: Orthogonal Matrices A ∈ Rn×n is called an orthogonal matrix if ATA = I. This means that the columns of A are of length 1 and mutually orthogo- nal. (So the term ‘orthogonal matrix’ is really a misnomer; in a perfect world these matrices would be called ‘orthonormal matrices’.) ATA = I implies that det(A)2 = 1, so A−1 exists and AT = A−1. Also, then, AAT = I, meaning that the rows of an orthogonal matrix are also orthogonal. A.4. Matrix Rank and Fundamental Subspaces 203 A.4 Matrix Rank and Fundamental Subspaces Definition A.10: Range and Nullspace Let A ∈ Rm×n. The range or column space of A is defined as range(A) = {~y ∈ Rm|~y = A~x = n∑ i=1 ~aixi for some ~x ∈ Rn}. The kernel or null space of A is defined as null(A) = {~x ∈ Rn|A~x = 0}. Similarly, the row space of A (the space spanned by the rows of A) is, in fact, the column space of AT , i.e., range(AT ). The rank r of a matrix A is the dimension of the column space: Definition A.11: Rank rank(A) = dim(range(A)) This is the number of linearly independent columns of A. It can be shown that this equals the number of linearly independent rows ofA, i.e., r = dim(range(A)) = dim(range(AT )). Theorem A.12: Dimensions of Fundamental Subspaces Let A ∈ Rm×n. Then 1. dim(range(A)) + dim(null(AT )) = m 2. dim(range(AT )) + dim(null(A)) = n Theorem A.13: Orthogonality of Fundamental Subspaces range(A) and null(AT ) are orthogonal subspaces of Rm Proof. If ~yr ∈ range(A), then ~yr = A~x for some ~x. If ~yn ∈ null(AT ), then AT~yn = 0. Then ~yTr ~yn = (A~x) T~yn = ~x TAT~yn = 0. 204 Appendix A. Appendices A.5 Matrix Determinants Definition A.14 The determinant of a matrix A ∈ Rn×n is given by det(A) = n∑ j=1 (−1)i+jaij det(Aij), for fixed i, with Aij = a11 a12 · · · a(1)(j−1) a(1)(j+1) · · · a1n a21 a22 · · · a(2)(j−1) a(2)(j+1) · · · a2n … … … … … a(i−1)(1) a(i−1)(2) · · · a(i−1)(j−1) a(i−1)(j+1) · · · a(i−1)(n) a(i+1)(1) a(i+1)(2) · · · a(i+1)(j−1) a(i+1)(j+1) · · · a(i+1)(n) … … … … … an1 an2 · · · a(n)(j−1) a(n)(j+1) · · · ann . i.e. the matrix Aij is an (n− 1)× (n− 1) matrix obtained by removing row i and column j from the original matrix A. Theorem A.15 If A ∈ Rn×n is a triangular matrix, then det(A) = n∏ i=1 aii. A.6. Eigenvalues 205 A.6 Eigenvalues We consider square real matrices A ∈ Rn×n. A.6.1 Eigenvalues and Eigenvectors Definition A.16: Eigenvalues and Eigenvectors Let A ∈ Rn×n. λ is called an eigenvalue of A if there is a vector ~x 6= 0 such that A~x = λ~x, where ~x is called an eigenvector associated with λ. Notes: • The eigenvalue may equal zero, but the eigenvector is required to be nonzero. • If ~x is an eigenvector of A with associated eigenvalue λ, then a~x for any a ∈ R \ 0 is also an eigenvector of A, associated with the same eigenvalue. Definition A.17: Characteristic Polynomial Let A ∈ Rn×n. The degree-n polynomial p(λ) = det(A− λI) is called the characteristic polynomial of A. The characteristic polynomial can be factored as p(λ) = (λ1−λ) . . . (λn−λ), where λ1, . . . , λn are the n eigenvalues of A, which we order as |λ1| ≤ |λ2| ≤ . . . ≤ |λn|. Note that some eigenvalues may occur multiple times, and some may be complex (in which case they occur in complex conjugate pairs). Definition A.18: Algebraic and Geometric Multiplicity Let A ∈ Rn×n. The algebraic multiplicity of an eigenvalue λi of A, µA(λi), is the multiplicity of λi as a root of p(λ). The geometric multiplicity of λi, µG(λi), is the number of linearly indepen- dent eigenvectors associated with λi. In other words, the geometric multiplicity µG(λi) = dim(E), where E = {~x | (A− λiI)~x = 0} is the eigenspace associated with λi. Theorem A.19: Relation of Algebraic and Geometric Multiplicities Let A ∈ Rn×n. The algebraic and geometric multiplicities of the eigenvalues satisfy the following properties. 1. µA(λi) ≥ µG(λi) ≥ 1 for all i = 1, . . . , n 2. A has n linearly independent eigenvectors iff µA(λi) = µG(λi) for all i = 1, . . . , n. If A has n linearly independent eigenvectors, it can be diagonalised. 206 Appendix A. Appendices A.6.2 Similarity Transformations Definition A.20: Similarity Transformation Let A,B ∈ Rn×n with B nonsingular. Then the transformation from A to B−1AB is called a similarity transformation of A. A and B−1AB are called similar. Theorem A.21: Eigenvalues of Similar Matrices Let A,B ∈ Rn×n with B nonsingular. Then A and B−1AB have the same eigenvalues (with the same algebraic and geometric multiplicities). This can be shown using that A~x = λ~x, ~x 6= 0 is equivalent with AB~y = λB~y, ~y 6= 0, for ~y given by ~y = B−1~x. This is equivalent with (B−1AB)~y = λ~y, ~y 6= 0, so any eigenvalue of A is also an eigenvalue of B−1AB. A.6.3 Diagonalisation Definition A.22: Diagonalisable and Defective Matrices Let A ∈ Rn×n. A is called diagonalisable if it has n linearly independent eigenvectors; otherwise, it is called defective. Suppose A ∈ Rn×n has n linearly independent eigenvectors ~xi. Let X be the matrix with the eigenvectors as its columns: X = [~x1| . . . |~xn] . Then AX = X Λ, with Λ = λ1 0 · · · 0 0 λ2 · · · 0 … … . . . … 0 0 · · · λn , or X−1AX = Λ, i.e., the similarity transformation with X diagonalises A. If A is defective, it can be transformed into the so-called Jordan form (which, in some sense, is almost diagonal), using its n generalised eigenvectors. We won’t need to consider the Jordan form in these notes. A.6. Eigenvalues 207 A.6.4 Singular Values of a Square Matrix Let A ∈ Rn×n. Let λi(ATA) and λi(AAT ), i = 1, . . . , n, be the eigenvalues of ATA and AAT , respectively, numbered in order of decreasing magnitude. Note that ATA and AAT are symmetric, so their eigenvalues are real, and they are positive semi-definite, so their eigenvalues are nonnegative. It can be shown they have the same eigenvalues. Definition A.23: Singular Values of a Square Matrix Let A ∈ Rn×n. Then σi(A) = √ λi(ATA) = √ λi(AAT ), i = 1, . . . , n, are called the singular values of A. 208 Appendix A. Appendices A.7 Symmetric Matrices We consider square matrices A ∈ Rn×n. Definition A.24 A ∈ Rn×n is called symmetric if A = AT . Theorem A.25: Eigenvalues and Eigenvectors of a Symmetric Matrix Let A ∈ Rn×n. If A is symmetric, then the eigenvalues of A are real and A has n linearly independent eigenvectors that can be chosen orthogonally. Definition A.26 A ∈ Rn×n is called symmetric positive definite (SPD) if A is symmetric and ~xTA~x > 0 for all ~x 6= 0. Theorem A.27: Eigenvalues of an SPD Matrix A symmetric matrix A ∈ Rn×n is SPD iff λi > 0 for all i = 1, . . . , n. Proof. ⇒ Assume A is SPD. Then ~xTA~x > 0 for all ~x 6= 0. Thus, ~xTi A~xi = λi‖~xi‖22 > 0 for any eigenvalue λi with associated eigenvector ~xi since ~xi 6= 0. This implies that λi > 0. ⇐ Assume λi > 0 for all i. A has n mutually orthogonal eigenvectors ~xi since it is symmetric, and any ~x 6= 0 can be expressed in the basis of the orthogonal eigenvectors. So ~x = ∑n i=1 ci~xi where at least one of the ci 6= 0. Thus, for any ~x 6= 0, ~xTA~x = ( n∑ i=1 ci~x T i )( n∑ j=1 cjA~xj) = ( n∑ i=1 ci~x T i )( n∑ j=1 cjλj~xj) = n∑ i=1 n∑ j=1 cicjλj~x T i ~xj = n∑ i=1 c2iλi~x T i ~xi (due to orthogonality) = n∑ i=1 c2iλi‖~xi‖22 > 0, so A is SPD. A.7. Symmetric Matrices 209 Note that an SPD matrixA is nonsingular (it does not have a zero eigenvalue). Definition A.28 A ∈ Rn×n is called symmetric positive semi-definite (SPSD) if A is symmetric and ~xTA~x ≥ 0 for all ~x 6= 0. Theorem A.29: Eigenvalues of an SPSD Matrix A symmetric matrix A ∈ Rn×n is SPSD iff λi ≥ 0 for all i = 1, . . . , n. 210 Appendix A. Appendices A.8 Matrices with Special Structure or Properties Some matrices have a special structure, which may imply special properties. A.8.1 Diagonal Matrices Definition A.30 Let A ∈ Rn×n. Then 1. A is called a diagonal matrix if aij = 0 for all i 6= j. With ~a the diagonal of a diagonal matrix A, we also write A = diag(~a). For any matrix A (also nondiagonal), we indicate its diagonal by ~a = diag(A). 2. A is called a tridiagonal matrix if aij = 0 for all i, j satisfying |i−j| > 1. A.8.2 Triangular Matrices Definition A.31 1. U ∈ Rn×n is called an upper triangular matrix if uij = 0 for all i > j. 2. L ∈ Rn×n is called a unit lower triangular matrix if lij = 0 for all i < j, and lii = 1 for all i. Note that det(U) = ∏n i=1 uii and det(L) = 1. A.8.3 Permutation Matrices Definition A.32 P ∈ Rn×n is called a permutation matrix if P can be obtained from the n× n identity matrix I by exchanging rows. Note that P has exactly one 1 in each row and column, and is otherwise 0. Note also that permutation matrices are orthogonal, i.e., PPT = I, or P−1 = PT , and det(P ) = ±1, depending on the parity of the permutation. A.8.4 Projectors Definition A.33 P ∈ Rn×n is called a projector if P 2 = P. I − P is also a projector, called the complementary projector to P . Note: P separates Rn into two subspaces, S1 = range(P ) and S2 = null(P ). We have ~x = P~x+(I−P )~x, where P~x ∈ S1, and (I−P )~x ∈ S2 since P (I−P )~x = (P − P 2)~x = 0. P projects ~x into S1 along S2. For example, P (~x + ~y) = P~x if ~y ∈ S2 = null(P ). A.8. Matrices with Special Structure or Properties 211 Definition A.34 P ∈ Rn×n is called an orthogonal projector if P 2 = P and PT = P. If P is an orthogonal projector, S1 = range(P ) and S2 = null(P ) are orthog- onal: (P~x)T~y = ~xTPT~y = ~xTP~y = 0 if ~y ∈ null(P ). So P projects ~x into S1 along S2, where S2 is orthogonal to S1. 212 Appendix A. Appendices A.9 Big O Notation A.9.1 Big O as h→ 0 Consider scalar functions f(x) and g(x) of a real variable x. Definition A.35 f(h) = O(g(h)) as h→ 0+ if ∃ c > 0, ∃h0 > 0: |f(h)| ≤ c |g(h)| ∀h with 0 ≤ h ≤ h0 Example A.36 Let f(h) = 3h2 + 4h3. Then f(h) = O(h2) as h→ 0 6= O(h3) = O(h). In words: f(h) approaches 0 at least as fast as h2 (up to a multiplicative constant), but not as fast as h3, and, clearly, also at least as fast as h. Note that 3h2 is the dominant term in f(h) as h→ 0. A.9.2 Big O as n→∞ Consider scalar functions f(n) and g(n) of an integer variable n. Definition A.37 f(n) = O(g(n)) as n→∞ if ∃ c > 0, ∃N0 ≥ 0: |f(n)| ≤ c |g(n)| ∀n ≥ N0 Example A.38 Let f(n) = 3n2 + 4n3. Then f(n) = O(n3) as n→∞ 6= O(n2) = O(n4). In words: f(n) approaches ∞ not faster than n3 (up to a multiplicative con- stant), but faster than n2, and, clearly, also not faster than n4. Note that 4n3 is the dominant term in f(n) as n→∞. A.10. Sparse Matrix Formats 213 A.10 Sparse Matrix Formats When matrices are sparse, it is often advantageous to store them in computer memory using sparse matrix formats. This can save large amounts of memory space, and it can also make computations faster if one implements methods that eliminate multiplications or additions with 0 (e.g., when computing matrix- vector or matrix-matrix products). Consider, for example, the following sparse matrix, of which we will only store the nonzero elements and their locations: A = 16 0 −18 0 0 12 0 0 0 0 14 18 0 12 11 10 . (A.1) In all what follows, i refers to rows, and j refers to columns. A.10.1 Simple List Storage A simple sparse storage format is to store the (i, j, value) triplets in a list, e.g., ordered by row starting from row 1 and from left to right: val 16 -18 12 14 18 12 11 10 i 1 1 2 3 3 4 4 4 j 1 3 2 3 4 2 3 4 A.10.2 Compressed Sparse Column Format An alternative with some advantages is the Compressed Sparse Column (CSC) format, which Matlab uses internally. In this format, the val array stores the nonzero values, ordered by column, starting from column 1, and from top to bottom within a column. The i val array stores the row index for each nonzero value. The j ptr array saves on storage versus the j array in the simple list storage, as follows: j ptr has one entry per column, and the entry indicates for each column where it starts in the val and i val arrays. The j ptr array has one additional entry at the end, which contains nnz(A) + 1. val 16 12 12 -18 14 11 18 10 i val 1 2 4 1 3 4 3 4 j ptr 1 2 4 7 9 As such, j ptr(k) indicates where column k starts in the val and i val arrays, and j ptr(k+1)-j ptr(k) indicates how many elements there are in row k. Some advantages of the Compressed Sparse Column format: • saves on storage space versus dense format, and, in many practical cases, versus simple list storage • finding all nonzeros in a given column of A is very fast Note, however, that finding all nonzero elements in a row of a sparse Matlab matrix can be very time-consuming! (Because the elements are stored per col- umn.) So if one needs to access rows of a sparse A repeatedly, it can be much faster to store AT as a sparse matrix instead and access its columns. 214 Appendix A. Appendices Bibliography [Ascher and Greif, 2011] Ascher, U. M. and Greif, C. (2011). A first course on numer- ical methods. SIAM, http://epubs.siam.org.ezproxy.lib.monash.edu.au/doi/book/ 10.1137/9780898719987. [Bjo¨rck, 2015] Bjo¨rck, A˚. (2015). Numerical methods in matrix computations. Springer, https://link-springer-com.ezproxy.lib.monash.edu.au/book/10.1007/ 978-3-319-05089-8. [Demmel, 1997] Demmel, J. W. (1997). Applied numerical linear algebra. SIAM, http: //epubs.siam.org.ezproxy.lib.monash.edu.au/doi/book/10.1137/1.9781611971446. [Gander et al., 2014] Gander, W., Gander, M. J., and Kwok, F. (2014). Sci- entific computing-An introduction using Maple and MATLAB, volume 11. Springer, https://link-springer-com.ezproxy.lib.monash.edu.au/book/10.1007/ 978-3-319-04325-8. [Linge and Langtangen, 2016] Linge, S. and Langtangen, H. P. (2016). Program- ming for Computations-MATLAB/Octave: A Gentle Introduction to Numer- ical Simulations with MATLAB/Octave. Springer, https://link-springer- com.ezproxy.lib.monash.edu.au/book/10.1007/978-3-319-32452-4. [Quarteroni et al., 2010] Quarteroni, A., Sacco, R., and Saleri, F. (2010). Numerical mathematics, volume 37. Springer, https://link-springer- com.ezproxy.lib.monash.edu.au/book/10.1007/b98885. [Saad, 2003] Saad, Y. (2003). Iterative methods for sparse linear systems. SIAM, http://www-users.cs.umn.edu/~saad/IterMethBook 2ndEd.pdf. [Saad, 2011] Saad, Y. (2011). Numerical Methods for Large Eigenvalue Problems: Re- vised Edition. SIAM, http://www-users.cs.umn.edu/~saad/eig book 2ndEd.pdf. [Trefethen and Bau III, 1997] Trefethen, L. N. and Bau III, D. (1997). Numerical lin- ear algebra, volume 50. SIAM, on overnight reserve in library. [Winlaw et al., 2015] Winlaw, M., Hynes, M. B., Caterini, A., and De Sterck, H. (2015). Algorithmic acceleration of parallel ALS for collaborative filtering: Speeding up distributed big data recommendation in Spark. In 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), pages 682–691. IEEE. 215