Skip to main content
留学咨询

辅导案例-ELEC 0030

By May 15, 2020No Comments

ELEC 0030 Numerical Methods page 1 © University College London 2019 – F.A. Fernandez Numerical Methods 1. Introduction Most mathematical problems in engineering and physics as well as in many other disciplines cannot be solved exactly (analytically) because of their complexity. They have to be tackled either by using analytical approximations, that is by choosing a simplified (and then only approximately correct) analytical formulation that can then be solved exactly, or by using numerical methods. In both cases, approximations are involved but this need not represent a disadvantage since in most practical cases, the parameters used in the mathematical model (the basic equations) will only be known approximately, by measurements or otherwise. Furthermore, even if the accuracy is high there is often no practical advantage of having an exact solution, when only a few decimal figures will have practical significance for interpretation or implementation. An ideal property sought in the design of numerical methods is that they should be “exact- in-the-limit”, that is, if we refine the discretization or if we take more terms in a series, or if we perform a larger number of iterations, etc (depending of the nature of the method), the result should only improve, approximating asymptotically the exact solution. Of course not all methods will have this property. Since approximations will be present, an important aspect of using numerical methods is to be able to determine the magnitude of the error involved. Naturally, we cannot know the error. This will imply to know the exact solution! But we can determine error bounds, or limits, so we can know that the error cannot exceed a certain measure. There are several sources for these errors. One of them could be an inaccurate mathematical model; that is, the mathematical description of the problem does not represent correctly the physical situation. Possible reasons for this are: Incomplete or incorrect theory Idealizations/Approximations/Uncertainty The second one could be due for example to idealization of the geometry of the problem or the properties of materials involved and the uncertainty of the value of material parameters, etc. This type of error concerns the mathematical model and the only possibility to reduce it is to improve the mathematical description of the physical problem. It does not concern the numerical calculation. It is though, an important part of the more general problem of “numerical or computer modelling”. Errors concerning the numerical calculations can be of three kinds: i) Blunder: Mistakes, computer or programming bugs. We cannot really dismiss this error. In practice, the possibility of its occurrence must always be considered, in particular when the result is not known approximately in advance and then, there are no means to evaluate the ‘reasonableness’ of the obtained result. ii) Discretization or Truncation error: This is an approximation error, where for example the value of a function is calculated with a finite number of terms in an infinite series, or an integral is estimated by the sum of a finite number of trapezoidal areas over (non-infinitesimal) segments. It is important to have an estimate or a bound (limit) for this type of error. iii) Round off error: (Sometimes called truncation error referring to the truncation in the number of decimal points). Arithmetic calculations can almost never be carried out with complete page 2 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez accuracy. Most numbers have infinite decimal representation, which must be rounded. But even if the data in a problem can be initially expressed exactly by finite decimal representation, divisions may introduce numbers that must be rounded; multiplication will also introduce more digits. This type of error has a random character that makes it difficult to deal with. So far the error we have been discussing is the absolute error or the error defined by: error = true value – approximation. A problem with this definition is that it doesn’t take into account the magnitude of the value being measured, for example an absolute error of 1 cm has a very different significance in the length of a 100 m bridge or of a 10 cm bolt. Another definition, that reflects this significance is the relative error defined as: relative error = absolute error/true value. In the example, the length in each case has a relative error of 10−4 and 0.1 respectively; or in percent, 0.01% and 10%. 2. Machine representation of numbers Not only numbers have to be truncated to be represented (for manual calculation or in a machine) but also all the intermediate results are subjected to the same restriction (over the complete calculation process). Of the continuum of real numbers, only a discrete set can be represented. Additionally, only a limited range of numbers can be represented in a given machine. Also, even if two numbers can be represented, it is possible that the product of an arithmetic operation between these numbers is not representable. Additionally, there will be occasions where results fall outside the range of representable numbers (underflow or overflow errors). Example: Some computers accept real numbers in a floating-point format, with say, two digits for the exponent; that is, in the form: mantissa exponent ± 0.xxxxxxE ±xx which can only represent numbers in the range: 10–100 ≤ y < 1099. This is a normalized form, where the mantissa is defined such that: 0.1 ≤ mantissa < 1 We will see this in more detail next. 2.1 Floating-Point Arithmetic and Roundoff Error Any number can be represented in a normalised format in the form: x = ± a×10b in a decimal representation. We can even use any other number as the exponent base and we have a more general form: x = ± a×βb, common examples are binary, octal and hexadecimal representations. In all of these, a is called the mantissa, a number between 0.1 and 1, which can be infinite in length (for example for π or for 1/3 or 1/9), β is the base (10 for decimal representation) and b is the exponent (positive or negative), which can also be infinite. If we want to represent these numbers in a computer or use them in calculations, we need to truncate both the mantissa (to a certain number of places), and the exponent, which limits the range (or size) of numbers that can be considered. Now, if A is the set of numbers exactly representable in a given machine, the question arises of how to represent a number x not belonging to A ( x ∉A ). This is encountered not only when reading data into a computer, but also when representing intermediate results in the computer during a calculation. Results of the elementary arithmetic operations between two numbers need not belong to A. Let’s see first how a number is represented ELEC 0030 Numerical Methods page 3 © University College London 2019 – F.A. Fernandez (truncated) in a machine. A machine representation can in most cases be obtained by rounding: x ––––––––––––– fl(x) Here and from now on, fl(x) will represent the rounded form of x (this is, with a rounded mantissa and limited exponent), and not just its representation in normalized floating-point format. For example, in a computer with t = 4 digit representation for the mantissa: fl(π) = 0.3142E 1 fl(0.142853) = 0.1429E 0 fl(14.28437) = 0.1428E 2 In general, x is first represented in a normalized form (floating-point format): x = a×10b (if we consider a decimal system), where a is the mantissa and b the exponent; 1 >  a ≥ 10–1 (so that the first digit of a after the decimal point is not 0). Then suppose that the decimal representation of  a is given by:  a = 0.α1α2α3α4 … αi … 0 ≤ αi ≤ 9, α1 ≠ 0 (2.1) (where a can have infinite terms!) Then,
we form: a’ = 0.α1α2 ⋅ ⋅ ⋅αt if 0 ≤ αt +1 ≤ 4 0.α1α2 ⋅ ⋅ ⋅αt +10 − t if α t+1 ≥ 5    (2.2) That is, only t digits are kept in the mantissa and the last one is rounded up: αt is incremented by 1 if the next digit αt+1 ≥ 5 and all digits after αt are deleted. The machine representation (truncated form of x) will be: fl(x) = sign(x) a’ 10b (2.3) The exponent b must also be limited. So a floating-point system will be characterized by the parameters: t, length of the mantissa; β, the base (10 in the decimal case as above) and L and U, the limits for the exponent b: L ≤ b ≤ U. To clarify some of the features of the floating-point representation, let’s examine the system (β, t, L, U) = (10, 2, –1, 2). The mantissa can have only two digits: Then, there are only 90 possible forms for the mantissa (excluding 0 and negative numbers): 0.10, 0.11, 0.12, . . . , . . , 0.98, 0.99 The exponent can vary from –1 to 2, so it can only take 4 possible values: –1, 0, 1 and 2. Then, including now zero and negative numbers, this system can represent exactly only 2×90×4 + 1 = 721 numbers: The set of floating-point numbers is finite. The smallest positive number in the system is 0.10×10–1 = 0.01 and the largest is: 0.99×102 = 99. We can also see that the spacing between numbers in the set varies; the numbers are not equally page 4 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez spaced: At the small end (near 0): 0, 0.01, 0.02, etc the spacing is 0.01 while at the other extreme: 97, 98, 99 the spacing is 1 (100 time bigger in absolute terms). In a system like this, we are interested in the relative error produced when any number is truncated and rounded in order to be represented in the system. (We are here excluding the underflow or overflow errors produced when we try to represent a number which is outside the range defined by L and U, for example 1000 or even 100 in the example above.) It can be shown that the relative error of fl(x) is bounded by: fl(x) − x x ≤ 5 ×10− t = eps (2.4) This limit eps is defined here as the machine precision. Demonstration: From (2.1) and (2.2), the normalized decimal representation of x and its truncated floating- point form, we have that the maximum possible difference between the two forms is 5 at the decimal position t+1, that is: fl(x) − x ≤ 5 ×10 − (t+1) ×10b also, since x ≥ 0.1×10b or 1 x ≤ 10 ×10−b , we obtain the condition (2.4). From equation (2.4) we can write: fl(x) = x(1+ε) (2.5) where ε ≤ eps, for all numbers x. The quantity (1+ε) in (2.5) cannot be distinguished from 1 in this machine representation, and the maximum value of ε is eps. So, we can also define the machine precision eps as the smallest positive machine number g for which 1 + g > 1. Definition: machine precision eps= min g g +1 >1,g > 0{ } (2.6) 2.2 Error in basic operations The result of arithmetic operations between machine numbers will also have to be represented as machine numbers, then, for each arithmetic operation and assuming that x and y are already machine numbers, we will have: x + y –––we get –––fl(x + y) which is equal to (x + y)(1 + ε1); also: (2.7) x − y ––––––––––––fl(x − y) = (x − y)(1 + ε2) (2.8) x*y –––––––––––––fl(x*y) = (x*y)(1 + ε3) (2.9) x/y –––––––––––––fl(x/y) = (x/y)(1 + ε4) with all  εi  ≤ eps (2.10) If x and y are not floating-point numbers (machine numbers) they will have to be converted first giving: x + y –––––––––––– fl(x + y) = fl(fl(x) + fl(y)) ELEC 0030 Numerical Methods page 5 © University College London 2019 – F.A. Fernandez and similarly for the rest. Let’s examine for example the subtraction of two such numbers: z = x – y , ignoring higher order error terms: fl(z) = fl(fl(x) – fl(y)) = (x(1 + ε1) – y(1 + ε2))(1 + ε3) = ((x – y) + x ε1 – yε2)(1 + ε3) = (x – y) + x ε1 – yε2 + (x – y)ε3 Then: 1 23 ( ) 1 x yz z x y z x y x y ε εε  +− − = + ≤ +  − −  fl eps We can see that if x approaches y the relative error can blow up, especially for large values of x and y. The maximal error bounds are pessimistic and in practical calculations, errors might tend to cancel. For example, in adding 20000 numbers rounded to, say, 4 decimal places the maximum error will be 0.5×10–4×20000 = 1 (imagining the maximum absolute truncation of 0.00005 in every case) while it is extremely improbable that this case occurs. From a statistical point of view, one can expect that in about 90% of the cases, the error will not exceed 0.005. Example Let’s compute the difference between a = 1200 and b = 1194 using a floating-point system with a 3-digit mantissa: fl(a – b) = fl(fl(1200) – fl(1194)) = fl(0.120×104 – 0.119×104) = 0.001×104 = 10 where the correct value is 6, giving a relative error of: 0.667 (or 66.7%). The machine precision for this system is eps = 5×10–t and the error bound above gives a limit for the relative error of eps1 + a + b a − b       = 2.0 2.3 Error propagation in calculations One of the important tasks in computer modelling is to find algorithms where the error propagation remains bounded. In this context, an algorithm is defined as a finite sequence of elementary operations that prescribe how to calculate the solution to a problem from given input data (as in a sequence of computer instructions). Problems can arise when one is not careful, as is shown in the following example: Example Assume that we want to calculate the sum of three floating-point numbers: a, b, c. This has to be done in sequence, that is, using any of the next two algorithms: i) (a + b) + c or ii) a + (b + c) page 6 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez If the numbers are in floating-point format with t = 8 decimals and their values are for example: a = 0.23371258E-4 b = 0.33678429E 2 c = -0.33677811E 2 The two algorithms will give the results: i) 0.64100000E-3 ii) 0.64137126E-3 The exact result (which needs 14 decimal points to calculate) is 0.641311258E-3 Exercise 2.1 Show, using an error analysis, why the case ii) gives a more accurate result for the numbers of the example above. Neglect higher order error terms; that is, products of the form: ε1ε2. Example Determination the error propagation in the calculation of y = x − a( )2 using floating-point arithmetic, by two different algorithms, when x and a are already floating-point numbers. a) direct calculation: y = x − a( )2 [ ]21 2( ) ( )(1 ) (1 )y x a ε ε= − + +fl 2 21 2( ) ( ) (1 ) (1 )y x a ε ε = − + + fl And only preserving first order error terms: [ ]2 2 21 2 1 2( ) ( ) (1 ) (1 ) ( ) (1 2 )(1 )y x a x aε ε ε ε = − + + = − + + fl 2 1 2( ) ( ) (1 2 )y x a ε ε= − + +fl then: 2 1 2( ) ( ) (2 )y y x a ε ε− = − +fl or 1 2 ( ) 2y yy y ε ε−∆ = = +fl We can see that the relative error in the calculation of y using this algorithm is given by 1 22ε ε+ , so it is less than 3 eps. b) Using the expanded form: y = x2 − 2ax + a2 ( )2 21 2 3 4 5 6( ) (1 ) 2 (1 )(1 ) (1 ) (1 ) (1 )y x ax aε ε ε ε ε ε = + − + + + + + + fl This is, taking the square of x first (with its error) and subtracting the product 2ax (with its error) and the error in the subtraction, and then, adding the last term with its error and the corresponding error due to that addition. Solving this, keeping only first order error terms we get: 2 21 4 6 2 3 4 6 5 6( ) (1 ) 2 (1
) (1 )y x ax aε ε ε ε ε ε ε ε ε= + + + − + + + + + + +fl ELEC 0030 Numerical Methods page 7 © University College London 2019 – F.A. Fernandez 2 2 2 2 2 21 4 2 3 4 5 6( ) 2 ( ) 2 ( ) ( 2 )y x ax a x ax a x ax aε ε ε ε ε ε ε= − + + + − + + + + − +fl from where we can write: 2 2 6 1 4 2 3 4 52 2 2 ( ) 2( ) ( ) ( ) ( ) ( ) y y x ax ay y x a x a x a ε ε ε ε ε ε ε−∆ = = + + − + + + − − − fl and we can see that there will be problems with this calculation if (x – a)2 is too small compared with either x2 or a2. The first term above is bounded by eps while the second will be 2eps multiplied by the amplification factor x2 / x − a( )2 . For example in if x = 15 and a = 14, the three amplification factors will be respectively: 225, 420 and 196, which gives a total error bound of (1 + 450 + 1260 + 196)eps = 1907eps compared with 3 eps from algorithm a). Exercise 2.2 For y = a – b compare the error bounds when a and b are and are not already defined as floating-point numbers. Exercise 2.3 Determine the error propagation characteristics of two algorithms to calculate a) y = x2 − a2 and b) 3( 1)y x= − . Assume in both cases that x and a are floating point numbers. page 8 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez 3. Root Finding: Solution of nonlinear equations This is a problem of very common occurrence in engineering and physics. We’ll study in this section several numerical methods to find roots of equations. The methods can be basically classified as “bracketing methods” and “open methods”. In the first case, the solution lies inside an interval limited by a lower and an upper bound. The methods are always convergent as the iterations progress. In contrast, the open methods are based on procedures that require one starting point or two that not necessarily enclose the root. Because of this, sometimes they diverge. However when they do converge, they usually do that faster than the bracketing methods. 3.1 Bracketing Methods 3.1.1 The Bisection Method If a function is continuous in the interval [a, b] and f(a) and f(b) are of different sign (f(a)f(b)<0), then at least one root of f(x) will lie in that interval. The bisection method is an iterative procedure that starts with two points where f(x) has different sign, that is, that “bracket” a root of the function. We now define the point 2 bac += (3.1) as the midpoint of the interval [a, b], and examine the sign of f(c). There are now three possibilities: f(c)=0, in which case the solution is c, f(c) is positive or f(c) is negative. The next interval where to search for the root will be either [a, c] or [c, b] if a and c or c and b are of different sign, respectively. (Equivalently, we can search for the case f(a)f(c)<0 or f(c)f(b)<0). The process then continues, each time halving the size of the search interval and “bracketing” the solution as shown in the figure below.. Fig. 3.1 Fig. 3.2 Convergence and error Since we know that the solution lies in the interval [a, b] the absolute error for c is bounded by (b−a)/2, then, we can see that after n iterations, halving the interval in each iteration, the search interval would have reduced in length to (b – a)/2n, so the maximum error after n iterations is: nn abc 2 − ≤−α (3.2) where α is the exact position of the root and cn is the nth approximation found by this method. Furthermore, if we want to find the solution with a tolerance ε (that is, εα ≤− nc ), we can 46 a b c a n bn cn an+1 bn+1 cn+1 ELEC 0030 Numerical Methods page 9 © University College London 2019 – F.A. Fernandez calculate the maximum number of iterations required from the expression above. Naturally, if at one stage the solution lies at the middle of the current interval the search finishes early. An approximate relative error (or percent error) at iteration n+1 can be defined as: 1 1 + + − = n nn c ccε but from the figure above we can see that 2 11 1 ++ + −=− nn nn abcc and since 2 11 1 ++ + += nn n abc , the relative error at iteration n+1 can be written as: 11 11 ++ ++ + − = nn nn ab abε (3.3) This expression can also be used to stop iterations. Exercise 3.1 Demonstrate that the number of iterations required to achieve a tolerance ε is the integer that satisfies: 2log log)log( ε−− ≥ abn (3.4) Example The function: )3cos()( xxf = , has one root in the interval [0, 1] ( 6rx π= = 0.523598775598). The following simple Matlab program implements the bisection method to find this root. a=0; b=1; %limits of the search interval [a,b] eps=1e-6; %sets tolerance to 1e-6 fa=ff(a); fb=ff(b); if(fa*fb>0) % is the root not between a and b? disp(‘root not in the interval selected’) else % the root is between a and b n=ceil((log(b-a)-log(eps))/log(2)); %rounded up to closest integer disp(‘Iteration number a b c’) for i=1:n c=a+0.5*(b-a); %c is set as the midpoint between a and b disp([sprintf(‘%8d’,i),’ ‘,sprintf(‘%15.8f’,a,b,c)]) fc=ff(c); if(fa*fc)<0 %the root is between a and c b=c; fb=fc; elseif(fa*fc)>0 %the root is between b and c a=c; fa=fc; else return end end page 10 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez end together with the function definition: function y=ff(x) %**************************** y=cos(3*x); %**************************** And the corresponding results are: Iteration number a b c 1 0.00000000 1.00000000 0.50000000 2 0.50000000 1.00000000 0.75000000 3 0.50000000 0.75000000 0.62500000 4 0.50000000 0.62500000 0.56250000 5 0.50000000 0.56250000 0.53125000 6 0.50000000 0.53125000 0.51562500 7 0.51562500 0.53125000 0.52343750 8 0.52343750 0.53125000 0.52734375 9 0.52343750 0.52734375 0.52539063 10 0.52343750 0.52539063 0.52441406 11 0.52343750 0.52441406 0.52392578 12 0.52343750 0.52392578 0.52368164 13 0.52343750 0.52368164 0.52355957 14 0.52355957 0.52368164 0.52362061 15 0.52355957 0.52362061 0.52359009 16 0.52359009 0.52362061 0.52360535 17 0.52359009 0.52360535 0.52359772 18 0.52359772 0.52360535 0.52360153 19 0.52359772 0.52360153 0.52359962 20 0.52359772 0.52359962 0.52359867 Provided that the solution lies in the initial interval, and since the search interval is continually divided by two, we can see that this method will always converge to the solution and will find it within a required precision in a finite number of iterations. However, due to the rather blind choice of solution (it is always chosen as the middle of the interval), the error doesn’t vary monotonically. For the previous example 0)3cos()( == xxf : Iteration number c f(c) error % 1 0.50000000 0.07073720 4.50703414 2 0.75000000 -0.62817362 -43.23944878 3 0.62500000 -0.29953351 -19.36620732 4 0.56250000 -0.11643894 -7.42958659 5 0.53125000 -0.02295166 -1.46127622 6 0.51562500 0.02391905 1.52287896 7 0.52343750 0.00
048383 0.03080137 8 0.52734375 -0.01123469 -0.71523743 9 0.52539063 -0.00537552 -0.34221803 10 0.52441406 -0.00244586 -0.15570833 11 0.52392578 -0.00098102 -0.06245348 12 0.52368164 -0.00024860 -0.01582605 13 0.52355957 0.00011762 0.00748766 ELEC 0030 Numerical Methods page 11 © University College London 2019 – F.A. Fernandez 14 0.52362061 -0.00006549 -0.00416920 15 0.52359009 0.00002606 0.00165923 16 0.52360535 -0.00001971 -0.00125498 17 0.52359772 0.00000317 0.00020212 18 0.52360153 -0.00000827 -0.00052643 19 0.52359962 -0.00000255 -0.00016215 20 0.52359867 0.00000031 0.00001998 We can see that the error is not continually decreasing although in the end it has to be small. This is due to the rather “brute force” nature of the algorithm. The approximation to the solution is chosen blindly as the midpoint of the interval without and attempt at guessing its position inside the interval. For example, if at some iteration n, the magnitudes of )( naf and )( nbf are very different, say, )()( nn bfaf >> , it is likely that the solution is closer to b than to a, if the function is smooth. A possible way to improve it is to select the point c by interpolating the values at a and b. This is called the “regula falsi” method or method of false position. 3.1.2 Regula Falsi Method In this case, the next point is obtained by linear interpolation between the values at a and b. From the figure (left), we can see that: ac bc af bf − − = )( )( (3.5) from where )()( )()( afbf abfbafc − − = or alternatively: )()( ))(( afbf baafac − − += (3.6) Fig. 3.3 The algorithm is the same as the bisection method except for the calculation of the point c. In this case, for the same function, ( 0)3cos()( == xxf ) the solution, within the same absolute tolerance (10−6) is found in only 4 iterations: Iteration number a b c f(c) 1 0.00000000 1.00000000 0.50251446 0.06321078 2 0.50251446 1.00000000 0.53237237 -0.02631774 3 0.50251446 0.53237237 0.52359536 0.00001025 4 0.52359536 0.53237237 0.52359878 -0.00000000 and for the error: Iter. number c f(c) error % 1 0.50251446 0.06321078 4.02680812 a b c page 12 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez 2 0.53237237 -0.02631774 -1.67563307 3 0.52359536 0.00001025 0.00065258 4 0.52359878 -0.00000000 -0.00000008 We can see that the error decreases much more rapidly than in the bisection method. The size of the interval also decreases more rapidly. In this case the successive values are: Iter. number b-a b-a in bisection method 1 0.49748554 0.5 2 0.02985791 0.25 3 0.00877701 0.125 4 0.00000342 0.0625 However, this is not necessarily always the case and some occasions, the search interval can remain large. In particular, one of limits can remain stuck while the other converges to the solution. In that case the length of the interval tends to a finite value instead of converging to zero. In the following example for the function 1)( 10 −= xxf the solution requires 70 iterations to reach a tolerance of 10−6 with the regula falsi method while only 24 are needed with the bisection method. We can also see that the right side of the interval remains stuck at 1.3 and the size of the interval will tend to 0.3 in the limit instead of converging to zero. The figure shows the interpolating lines at each iteration. The corresponding approximations are the points where these lines cross the x- axis. Fig. 3.4 Standard regula falsi Fig. 3.5 Modified regula falsi 3.1.3 Modified regula falsi method A modified version of the regula falsi method that improves this situation consists of an algorithm that detects when one of the limits gets stuck and then reduces the value at that limit by 2, changing the slope of the approximating function. For the same example, this modified version reaches the solution in 13 iterations. The sequence of approximations and interpolating lines is shown in the figure above (right). 3.2 Open Methods These methods start with only one point or two but not necessarily bracketing the root. One of the simplest is the fixed point iteration. 3.2.1 Fixed point iteration This method is also called one-point iteration or successive substitution. For a function of the 0 0.2 0.4 0.6 0.8 1 1.2 -1 0 1 2 3 4 5 6 7 8 9 10 0 0.2 0.4 0.6 0.8 1 1.2 -1 0 1 2 3 4 5 6 7 8 9 10 ELEC 0030 Numerical Methods page 13 © University College London 2019 – F.A. Fernandez form 0)( =xf , the method simply consists of re-arranging it into the form: )(xgx = Example For the function 0505.01.15.0 2 =+− xx the algorithm can be set as: 505.01.05.0 2 +−= xxx or 2 1)1.0( 2 +− = xx . With an initial guess introduced in the right hand side, a new value of x is obtained and the iteration can continue. Starting from the value: x0 = 0.5, the successive values are: Iteration number x error % 0 0.5 23.40526755 1 0.58000000 11.15011036 2 0.61520000 5.757841193 3 0.63271552 3.074648057 4 0.64189291 1.668768191 5 0.64682396 0.913383012 6 0.64950822 0.502182715 7 0.65097964 0.276776655 8 0.65178928 0.152748337 9 0.65223571 0.08436105 10 0.65248214 0.046610413 11 0.65261826 0.025758511 12 0.65269347 0.014236789 Fig. 3.6 Plot of x and g(x) Fig. 3.7 Close-up showing the successive approximations Convergence This is a very simple method but solutions are not guaranteed. The following figures show situations when the method converges and when it diverges. In cases (a) and (b) the method converges, while in cases (c) and (d) it diverges. 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ↓ y=g(x) ← y=x 0.4 0.45 0.5 0.55 0.6 0.65 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 page 14 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez (a) convergence (b) convergence (c) divergence (d) divergence Fig. 3.8 (a – d) From the plots above we can see that it is relatively easy to determine when the method will converge, so the best way of ensuring success is to plot the functions )(xgy = and xy = . In a more rigorous way, we can also see that for convergence to occur the slope of g(x) should be less that that of x in the region of search. That is, 1)(‘ If divergence is predicted, a different form of re-writing the problem 0)( =xf in the form )(xgy = needs to be found that satisfies the condition above. For example, for the function 0133)( 2 =−+= xxxf , with a solution at 0.26376260 =x , we can separate it in the following two forms: (a) 3 13)( 2 +− == xxgx and (b) 143)( 2 −+== xxxgx In the first case, xxg 2)(‘ −= then 0.5275252)(‘ 0 −==xxxg while for the second case, 46)(‘ += xxg and then, 5.5825757)(‘ 0 ==xxxg . ↓ y=g(x) ← y=x y=g(x) → y=x → y=g(x) → ← y=x ← y=g(x) y=x → ELEC 0030 Numerical Methods page 15 © University College London 2019 – F.A. Fernandez Fig. 3.9(a) 3 13)( 2 +− == xxgx Fig. 3.9(b) 143)( 2 −+== xxxgx This illustrates the main deficiency of this method. Convergence often d
epends on how the problem is formulated. Additionally, divergence can also occur if the initial guess is not sufficiently close to the solution. 3.2.2 Newton-Raphson Method This is one of the most used methods for root finding. It also needs only one point to start the iterations, but as a difference to the fixed point iteration it will converge to the solution provided the function is monotonically varying in the region of interest. Starting from a point x0, a tangent to the function f(x) (a line with the slope of the derivative of f) is extrapolated to find the point where it crosses the x-axis, providing a new approximation. The same procedure is repeated until the error tolerance is achieved. The method needs repeated evaluation of the function and its derivative and an appropriate stopping criterion is the value of the function at the successive approximations: f(xn). Fig. 3.10 Form the figure we can see that at stage n: 1 )()(‘ +− = nn n n xx xfxf , then the next 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 ← y=x y=g(x) ↓ 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 ← y=x y=g(x) → x0x1 slope = f ‘(x0) page 16 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez approximation is found as: )(‘ )( 1 n n nn xf xfxx −=+ . (3.7) Example For the same function as in the previous examples: 0)3cos()( == xxf , and for the same tolerance of 10−6, the solution is found in 3 iterations starting from x0 = 0.3 (After 3 iterations the accuracy is better than 10−8). Starting from 0.5, only 2 iterations are sufficient. Iteration number x f(x) 0 0.30000000 0.62160997 1 0.56451705 -0.12244676 2 0.52339200 0.00062033 3 0.52359878 -0.00000000 The method can also be derived from the Taylor series expansion. This also provides a useful insight on the rate of convergence of the method. Considering the Taylor expansion truncated to the first order (see Appendix): 21 )2( 11 )(!2 )())((‘)()( iiiiiii xx fxxxfxfxf −+−+= +++ ξ (3.8) Considering now the exact solution xr and the Taylor expansion, evaluated at this point: 2 )2( )( !2 )())((‘)(0)( iririir xx fxxxfxfxf −+−+== ξ and reordering (assuming a single root – first derivative ≠ 0): 2 )2( )( )(‘!2 )( )(‘ )( ir ii i ir xxxf f xf xfxx −−−= ξ (3.9) Using now (3.7) for 1+ix : )(‘ )( 1 i i ii xf xfxx −=+ and substituting in (3.9) gives: 2 )2( 1 )()(‘!2 )( ir i ir xxxf fxx −−= + ξ which can be reordered as: 2 )2( 1 )()(‘!2 )()( ir i ir xxxf fxx −−=− + ξ (3.10) The error at stage i can be written as the difference between xr and xi: ( )i r ix x= −E then, from (3.10) we can write: (2) 2 1 ( ) 2 ‘( )i ii f f x ξ + − =E E Assuming convergence, both ξ and xi should eventually become xr, so the previous equation can be re-arranged on the form: (2) 2 1 ( ) 2 ‘( ) r i i r f x f x+ − =E E (3.11) ELEC 0030 Numerical Methods page 17 © University College London 2019 – F.A. Fernandez We can see that the relation between errors of successive order is quadratic. That means that on each Newton-Raphson iteration, the number of correct decimal points should double. This is what is called quadratic convergence. Although the convergence rate is generally quite good, there are cases that show poor or no convergence. An example is when there is an inflexion point near the root and in that case, the iteration values will start to progressively diverge from the solution. Another case is when the root is a multiple root, that is, when the first derivative is also zero. Stopping the iterations It can be demonstrated that the absolute error (difference between the current approximation and the correct value) can be written as a multiple of the difference between the last two consecutive approximations: )()( 1−−=−= nnnnn xxCxxαE and the constant Cn tends to 1 as the method converges. Then, a convenient criterion for stopping the iterations is when: εchoosing a small value for Cn, usually 1. 3.2.3 The Secant Method One of the difficulties of the Newton-Raphson method is the need to evaluate the derivative. This can be very inconvenient of difficult in some cases. However, the derivative can be approximated using a finite difference expression, for example, the backward difference: 1 1)()()(‘ − − − − ≅ ii ii i xx xfxfxf (3.13) If we now substitute this in the expression for the Newton-Raphson itertions, the following equation is obtained: )()( ))(( 1 1 1 − − + − − −= ii iii ii xfxf xxxfxx (3.14) which is the formula for the secant method. Example Use the secant method to find the root of xexf x −= −)( . Start with the estimates at x−1 = 0 and x0 = 1. The exact result is: 0.56714329… First iteration: 63212.0)(1 0.1)(0 00 11 −== == −− xfx xfx then 61270.0 163212.0 )01(63212.011 =−− −− −=x ε = 8% Second iteration: 07081.0)(61270.0 63212.0)(1 11 00 −== −== xfx xfx then 50.56383832 )63212.0(07081.0 )161270.0(07081.061270.02 =−−− −− −=x Note that in this case the 2 points are at the same side of the root (not bracketing it). Using Excel, a simple calculation can be made giving: page 18 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez i xi f(xi) error % -1 0 1 0 1 -0.63212 1 0.612700047 -0.070814271 8.032671349 2 0.563838325 0.005182455 -0.582738902 3 0.567170359 -4.24203E-05 0.004772880 4 0.567143307 -2.53813E-08 2.92795E-06 5 0.567143290 1.24234E-13 7.22401E-08 The secant method doesn’t require the evaluation of the derivative of the function as the Newton- Raphson method does but still, it suffers from the same problems. The convergence of the method is similar to that of Newton and similarly, it has severe problems if the derivative is zero or near zero in the region of interest. 3.2.4 Multiple Roots We have seen that some of the methods have poor convergence if the derivative is very small or zero. For higher order zeros (multiple roots), the function is zero and also the first n−1 derivatives (n is the order of the root). In this case the Newton-Raphson method (and the secant method will converge poorly). We can notice however, that if the function f(x) has a multiple root at x = α, the function: )(‘ )()( xf xfxg = (3.15) has a simple root at x = α (If the root of f is of order n, the root of the derivative is of order n−1). We can then use the standard Newton-Raphson method for the function g(x). Exercise 3.2 Use the Newton-Raphson method to find a root of the function xxexf −−= 11)( . Start the iterations with 00 =x . 3.2.5 Extrapolation and Acceleration Observing how the iterations proceed in the methods we have just studied, one can expect some advantages if an extrapolated value is calculated from the iterations in order to improve the next guess and accelerate the convergence. One of the best methods for extrapolation and acceleration is the Aitken’s 2δ method: Considering the sequence of values xn, the extrapolated sequence can be constructed by: 1( )n n n nx x x xα −= − − where 1 1 22 n n n n n x x x x x α − − − − = − + (3.16) Example For the function 20.5 1.1 0.505x x− + , (page 13) using the Fixed-Point method to find a root with and without acceleration gives the results that follow. The iterations are started with 5.00 =x and the alternative form: 2( 0.1) 1( ) 2 xx g x − += = is used. ELEC 0030 Numerical Methods page 19 © University College London 2019 – F.A. Fernandez Iter. Number Fixed Point method Accelerated FP method x error % x error % 0 0.5 23.40526755 — — 1 0.58 11.15011036 — — 2 0.6152 5.757841193 0.642857143 1.521058278 3 0.63271552 3.074648057 0.650063694 0.417090524 4 0.641892913 1.668768191 0.651994046 0.121380999 5 0.646823964 0.913383012 0.652550135 0.036194
034 6 0.649508224 0.502182715 0.652715129 0.010918594 7 0.650979644 0.276776655 0.652764775 0.003313424 8 0.651789284 0.152748337 0.65277982 0.001008684 9 0.652235707 0.08436105 0.652784397 0.00030759 10 0.652482138 0.046610413 0.652785792 9.38845E-05 11 0.652618256 0.025758511 0.652786217 2.86706E-05 12 0.652693469 0.014236789 0.652786347 8.75791E-06 This acceleration method can also be used embedded in the standard method; that is, after a couple of normal iterations, the Aitken’s method can be used to jump ahead, continue from there for another 2 iterations and used it again to further accelerate the convergence. For the Fixed-Point iteration for example, this can be done in the form: Starting from a value for x0: the first two iterates are found using the standard method: x1=g(x0); x2=g(x1); Now, we can use the Aitken’s extrapolation in a repeated form: alpha=(x2-x1)/(x2-2*x1+x0) xbar=x2-alpha*(x2-x1) and now we can refresh the initial guess: x0=xbar and re-start the iterations. For the same case of the example above this would give: Iter. Number Fixed Point method Accelerated FP method x error % x error % 0 0.5 23.40526755 — — 1 0.58 11.15011036 — — 2 0.6152 5.757841193 0.642857143 1.521058278 3 0.642857143 1.521058278 — — 4 0.647346939 0.833268844 — — 5 0.649794336 0.458353419 0.65272704 0.009094066 6 0.65272704 0.009094066 — — 7 0.65275359 0.005026806 — — 8 0.652768266 0.002778668 0.652786402 3.33604E-07 9 0.652786402 3.33604E-07 — — 10 0.652786403 1.84412E-07 — — 11 0.652786404 1.0194E-07 0.652786405 0 Notice the acceleration of convergence occurring at the 3rd, 6th and 9th iterations. Graphically this is shown in Fig. 3.11: page 20 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez Similarly for the Newton-Raphson method where the evaluation of x1, and x2 are replaced by the corresponding forms for the N-R method and the function and its derivative needs to be calculated at each stage: f0=f(x0) % calculation of the function df0=df(x0) % calculation of the derivative x1=x0-f0/df0 x2=x1-f1/df1 alpha=(x2-x1)/(x2-2*x1+x0) xbar=x2-alpha*(x2-x1) x0=xbar 3.2.6 Higher Order Methods All the methods we have seen consist of approximating the function using a straight line and looking for the intersection of that line with the axis. An obvious extension of this idea is to use a higher order approximation, for example, fitting a second order curve (a parabola) to the function and looking for the zero of this instead. A method that implements this is the Muller method. Müller’s method Using three points, we can find the equation of a parabola that fits the function and then, find the zeros of the parabola. This is the Müller or Müller-Traub method. For example, for the function: 26 −= xy (used to find the value of 6 2 ), the function (in blue) and the approximating parabola (in red) are shown in the figure: In this case, the 3 points are chosen as 5.01 =x , 5.12 =x and 0.13 =x In general, the points don’t need to be equally spaced, in order or bracketing a root as in the figure, which shows a case of a real root. Using Newton interpolation (see chapter 4) the parabola passing through the points: ),( 11 yx , ),( 22 yx and ),( 33 yx can be written as. 3 2 3 1 3 2( ) ( )( )y y c x x d x x x x= + − + − − where 12 12 1 xx yyc − − = , 23 23 2 xx yyc − − = and 13 12 1 xx ccd − − = Solving for the zero closest to 3x gives: 13 2 3 34 4)( 2 dysssigns yxx −+ −= (3.17) where )( 2312 xxdcs −+= . Fig. 3.11 Fig. 3.12 0.5 1 1.5 -5 0 5 10 x1 x2 x3 ↓ x4 ELEC 0030 Numerical Methods page 21 © University College London 2019 – F.A. Fernandez The results of the iterations are: Iteration number x error 1 1.07797900 0.07797900 2 1.37190697 0.29392797 3 1.37170268 -0.00020429 4 1.37148553 -0.00021716 5 1.17938786 -0.19209767 6 1.14443703 -0.03495083 7 1.12268990 -0.02174712 8 1.12242784 -0.00026206 9 1.12246205 0.00003420 10 1.12246205 0.00000000 The Müller’s method requires three points to start the iterations but it doesn’t require evaluation of derivatives as the Newton-Raphson’s. It can also be used to find complex roots, even is the starting values are real. 3.2.6 Systems of Nonlinear Equations In the previous discussions, we have considered only the solution of single nonlinear equations. However, there are many cases where systems of nonlinear equations require solutions. For this we will only consider one approach, an extension of the Newton-Raphson method seen before. Multivariate Newton Method In the one variable Newton’s method, the expression: 1 ( ) ‘( ) k k k k f xx x f x+ = − come from the Taylor series truncated to first order: 1 1( ) ( ) ‘( )( )k k k k kf x f x f x x x+ += + − where we set (our aim): 1( ) 0kf x + = . In the case of functions of more than one variable, say for example, ( , , )f u v w , the corresponding truncated Taylor series (to first order) would be: 1 1 1 1 1 1( , , ) ( , , ) ( ) ( ) ( ) k k k k k k k k k k k k k k k u v w f f ff u v w f u v w u u v v w w u v w+ + + + + + ∂ ∂ ∂ = + − + − + − ∂ ∂ ∂ and similarly for other functions g and h of u, v and w, constituting a system of equations. We can see that this can be written in compact form as: 21 1 1( ) ( ) ( )( ) (( ) )k k k k k k kF F DF O+ + += + − + −x x x x x x x (3.18) where ( , , )Tu v w=x and ( ) ( ( , , ), ( , , ), ( , , ))TF f u v w g u v w h u v w=x . ( )DF x is the matrix: ( ) f f f x y z g g gDF x y z h h h x y z  ∂ ∂ ∂  ∂ ∂ ∂  ∂ ∂ ∂ =  ∂ ∂ ∂  ∂ ∂ ∂   ∂ ∂ ∂  x known as the Jacobian matrix. (3.19) Following the same procedure as for the one variable case, we can set 1( ) 0kF + =x in (3.18) and we have: 10 ( ) ( )( )k k k kF DF +≈ + −x x x x page 22 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez or 11 ( ) ( )k k k kDF F − + = −x x x x (3.20) which can be written as: 1( ) ( );k k k k k kDF F += − = −x s x s x x (3.21) so the algorithm is: – choose a starting vector 0x – For k = 0, 1, …. – Calculate the matrix ( )DF x and evaluate it at kx – Solve the linear system of equations: ( ) ( )k k kDF F= −x s x for the increments s. – Calculate the next point as: 1k k k+ = +x x s – end Example Use The Multivariate Newton’s method starting with the guess vector 0 (1, 2) T=x to find a solution of the system: 3 2 2 0 1 0 v u u v − = + − = The matrix ( )DF x is: 23 1( ) 2 2 uDF u v  − =     x then 0 3 1 ( ) 2 4 DF −  =     x Using 0 (1, 2) T=x , the first increment 0s is found from: 1 0 2 3 1 1 ( ) 2 4 4 s F s −      = − = −          x The solution of this is 0 (0, 1) T= −s so the next point is 1 0 0 (1, 1) T= + =x x s . Setting a tolerance of 810− , convergence is achieved in 6 iterations and the result is: ( )6 6( , ) 0.826031357654187, 0.563624162161259x y = Fig. 3.13 Successive iterative solutions ELEC 0030 Numerical Methods page 23 © University College London 2019 – F.A. Fernandez 4. Interpolation and Approximation There are many occasions when there is a need to convert discrete data, from measurements or calculations into a smooth function. This could be necessary for example, to obtain estimates of the values between data points. It could be also necessary if one wants to represent a function by a simpler one that approximates its behaviour. Two approaches to this task can be used. Interpolation consists of fitting a smooth curve exactly to the da
ta points and creating the curve that spans these points. In approximation, the curve doesn’t necessarily fit exactly the data points but it approximates the overall behaviour of the data, following an established criterion. 4.1 Interpolation 4.1.1 Method of Undetermined Coefficients The simplest and most obvious form of finding a polynomial of order n that passes through a set of n+1 points would be to establish the n+1 equations: ( ) ; 1, 2, , 1n i ip x y i n= = + This results in a matrix problem to solve, with the matrix rows given by: 21 1, 2, , 1ni i ix x x i n  = +   However, this matrix problem is notoriously ill-conditioned (see chapter on Matrix Computations) and the results are very sensitive to small changes in the points position ( ix ) or to the values ( iy ) and can be highly inaccurate. 4.1.2 Lagrange Interpolation The basic interpolation problem can be formulated as: Given a set of nodes, {xi, i =0,…,n} and corresponding data values {yi, i =0,…,n}, find the polynomial p(x) of degree less or equal to n such that p(xi) = yi. Consider the family of functions: nk xx xx xL n kjj jk jn k ,1,0,)( ,0 )( = − − = ∏ ≠= (4.18) We can see that they are polynomials of order n and have the property (interpolatory condition):    ≠ = == ji ji xL jij n i 0 1 )( , )( δ (4.19) Then, if we define the polynomial by: ∑ = = n k n kkn xLyxp 0 )( )()( (4.20) then: i n k i n kkin yxLyxp == ∑ =0 )( )()( (4.21) The uniqueness of this interpolation polynomial can also be demonstrated (that is, that there is only one polynomial of order n or less that satisfy this condition. page 24 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez Lagrange Polynomials In more detail, from the general definition (4.18) the equation for the first order polynomial (straight line) passing through two points 0 0( , )x y and 1 1( , )x y is: (1) (1) 011 0 0 1 1 0 1 0 1 1 0 ( ) x xx xp x L y L y y y x x x x −− = + = + − − (4.22) The second order polynomial (parabola) passing through three points is: (2) (2) (2)2 0 0 1 1 2 2( )p x L y L y L y= + + 0 2 0 11 22 0 1 2 0 1 0 2 1 0 1 2 2 0 2 1 ( )( ) ( )( )( )( )( ) ( )( ) ( )( ) ( )( ) x x x x x x x xx x x xp x y y y x x x x x x x x x x x x − − − −− − = + + − − − − − − (4.23) In general, we can see that the interpolation polynomials have the form given in (4.20) for any order. Each of the Lagrange interpolation functions )(nkL associated to each of the nodes xk (given in general by (4.18)) is: ( ) 0 1 1 1 0 1 1 1 ( )( ) ( )( ) ( ) ( )( ) ( )( ) ( )( ) ( ) n k k n k k k k k k k k n x x x x x x x x x x N xL x x x x x x x x x x x D − + − + − − − − − = = − − − − −     (4.24) The denominator has the same form as the numerator and D = N(xk). Example Find the interpolating polynomial that passes through the three points: )4,2(),( 11 −=yx , )2,0(),( 22 =yx and )8,2(),( 33 =yx . Substituting in (4.20), or more specifically, (4.23): 8 )02)(22( )0)(2(2 )20)(20( )2)(2(4 )22)(02( )2)(0()(2 −+ −+ + −+ −+ + −−−− −− = xxxxxxxp 3 )2( 32 )2( 21 )2( 1 222 2 88 22 4 44 8 2)( yLyLyLxxxxxxp ++=++ − − + − = 2)( 22 ++= xxxp Fig. 4.1 shows the complete interpolating polynomial )(2 xp and the three Lagrange interpolation polynomials )()2( xLk k = 1, 2, 3 corresponding to each of the nodal points. Note that the function corresponding to one node has a value 1 at that node and 0 at the other two. Fig. 4.1 -2 -1 0 1 2 -2 0 2 4 6 8 p2(x) p2(x) p2(x)p2(x)p2(x) p2(x)p2(x) -1.5 -0.5 0.5 1.5 0 0.5 1 L1 (2)(x) L2 (2)(x) L3 (2)(x) ELEC 0030 Numerical Methods page 25 © University College London 2019 – F.A. Fernandez Exercise 4.1 Find the 5th order Lagrange interpolation polynomial to fit the data: xd = {0, 1, 2, 3, 4, 5} and yd = {2, 1, 3.4, 3.8, 5.8, 4.8}. Exercise 4.2 Show that an arbitrary polynomial of order n can be represented exactly by ∑ = = n i ii xLxpxp 0 )()()( using an arbitrary set of (distinct) data points xi. 4.1.3 Newton Interpolation It can be easily demonstrated that the polynomial interpolating a set of points is unique (Ex. 4.2), and the Lagrange method allows us to find it. The Newton interpolation method gives eventually the same result but it can be more convenient in some cases. In particular, it is simpler to extend the interpolation adding extra points, which in the Lagrange method would need a total re- calculation of the interpolation functions. The general form of a polynomial used to interpolate n + 1 data points is: nn xbxbxbxbbxf +++++=  3 3 2 210)( (4.25) Newton’s method, and Lagrange’s, give us a procedure to find the coefficients bi. From the figure we can write: 12 12 1 1 xx yy xx yy − − = − − which can be re-arranged as: )( 1 12 12 1 xxxx yyyy − − − += (4.26) This is the Newton’s expression for the first order interpolation polynomial (or linear interpolation). Fig. 4.2 That is, the Newton form of the equation of a straight line that passes through 2 points (x1, y1) and (x2, y2) is )()( 110 xxaaxp −+= ; where:     − − = = 12 12 1 10 xx yya ya (4.27) Similarly, the general expression for a second order polynomial passing through the 3 data points: (x1, y1), (x2, y2) and (x3, y3) can be written as: 22102 )( xbxbbxp ++= x1 x x2 y1 y y2 page 26 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez or, re-arranging: ))(()()( 2121102 xxxxaxxaaxp −−+−+= (4.28) Substituting the values for the 3 points, we get, after some re-arrangement:          − − − − − − = − − = = 13 12 12 23 23 2 12 12 1 10 xx xx yy xx yy a xx yya ya (4.29) The individual terms in the above expression are usually called “divided differences” and denominated by the symbol D. That is, ii ii i xx yyDy − − = + + 1 1 , ii ii i xx DyDyyD − − = + + 2 12 , ii ii i xx yDyDyD − − = + + 3 2 1 2 3 , etc In this form, (4.29) takes the form: 10 ya = , 11 Dya = and 1 2 2 yDa = . The general form of Newton interpolation polynomials is then an extension of (4.27) and (4.28): +−−+−+= ))(()()( 212110 xxxxaxxaaxpn (4.30) or, ∑ = += n i iin xWaaxp 1 0 )()( with ∏ = −= i j ii xxxW 1 )()( (4.31)) with the coefficients: 10 ya = , 1yDa i i = Example We can consider the previous example of finding the interpolating polynomial that passes through the three points: )4,2(),( 11 −=yx , )2,0(),( 22 =yx and )8,2(),( 33 =yx . In this case it is usual and convenient to arrange the calculations in a table with the following quantities in each column: ix iy iDy iyD 2 −2 4 1)2(0 42 12 12 −= −− − = − − xx yy 0 2 1 )2(2 )1(3 13 12 = −− −− = − − xx DyDy 302 28 23 23 = − − = − − xx yy 2 8 Then, the coefficients are: a0 = 4, a1 = −1 and a2 = 1 and the polynomial is: 2)0)(2()2(4)( 22 ++=−+++−= xxxxxxp ELEC 0030 Numerical Methods page 27 © University College London 2019 – F.A. Fernandez Note that it is the same polynomial found using Lagrange interpolation. One important property of the Newton’s construction of the interpolating polynomial is what makes it easy to extend including more points. If additional points are included, the new higher order polynomial can be easily constructed from the previous one: )()()( 111 xWaxpxp nnnn +++ += with ))(()( 11 ++ −= nnn xxxWxW and 1 1 1 yDa n n + + = (4.32) In this way, it has many similarities with the Taylor expansion, where additional terms increase the order of the polynomial. These similarities allow a treatment of the error in the same way as it is done with Taylor expansions. Exercise 4.3 Find the 5th order interpolation polynomial to fit the data: xd = {0, 1, 2, 3, 4, 5} and yd = {2, 1, 3.4, 3.8, 5.8, 4.8}, this time using New
ton interpolation. Some Practical Notes: In some of the examples seen here we have used equally spaced data. This is by no means necessary and the data can have any distribution. However, if the data is equally spaced, simpler expressions for the divided differences can be derived (not done here). ____________________________ One of the problems with interpolation of data using a single polynomial is that this technique is very sensitive to noisy data. A very small change in the values of the data can lead to a drastic change in the interpolating function. This is illustrated in the following example: The following figure shows the interpolating polynomial (in blue) for the data: xd = {0, 1, 2, 3, 4, 5} yd = {2, 1, 3.4, 4.8, 5.8, 4.8} Now, if we add two more points with a slight amount of noise: xd’ = {2.2, 2.7} and yd’ = {3.5, 4.25} (shown with the filled black markers), the new interpolation polynomial (red line) shows a dramatic difference to the first one. Fig. 4.3 4.1.4 Hermite Interpolation The problem arises because the extra points force a higher degree polynomial and this can have a higher oscillatory behaviour. Another approach that avoids this problem is to use data for the derivative of the function too. If we also ask for the derivative values to be matched at the nodes, the oscillations will be prevented. This is done with the “Hermite Interpolation”. The 0 1 2 3 4 5 0 1 2 3 4 5 6 page 28 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez development is rather similar to that of the Newton’s method but more complicated due to the involvement of the derivative values. It can also be constructed easily with the help of a table (as in Newton’s) and divided differences. We will not cover here the details of the derivation but simply, the procedure to find it. The table is similar to that for the Newton interpolation but we enter the data points twice (see below) and the derivatives values are placed between repeated data points, in alternate rows as the first divided differences. The initial set-up for 2 points is marked with red circles. i ix iy iDy iyD 2 iyD 3 1 1x 1y ‘1y 1 1x 1y 12 11 ‘ xx yDyA − − = 12 12 1 xx yyDy − − = 12 xx ABC − − = 2 2x 2y 12 12 ‘ xx DyyB − − = ‘2y 2 2x 2y The corresponding interpolating polynomial is: )()()()(‘)( 2 2 1 2 11112 xxxxCxxAxxyyxH −−+−+−+= (4.33) The coefficients of the successive terms are marked in the table with blue squares. The figure shows the function )sin()( xxy π= (+) compared with the interpolated curves using Hermite, Newton and cubic splines method (see next section). The data points are marked by circles. Newton interpolation results in a polynomial of order 4 while Hermite gives a polynomial of order 7. (If 7 points are used for the Newton iteration the results are quite similar.) Fig. 4.4 Another approach to avoid the oscillations present when using high order polynomials is to use lower order polynomials to interpolate subsets of the data and assemble the overall approximating function piecewise. This is what is called “spline interpolation”. 4.1.5 Spline Interpolation Any piecewise interpolation of data by low order functions is called spline interpolation and the 0 0.5 1 1.5 -1 -0.5 0 0.5 1 1.5 cubic splines Newton Hermite ELEC 0030 Numerical Methods page 29 © University College London 2019 – F.A. Fernandez simplest and widely used is the piecewise linear interpolation, or simply, joining consecutive data points by straight lines. First Order Splines If we have a set of n+1 data points: (xi, yi), i = 0, , n; the first order splines (straight lines) can be defined as: 0 0 0 0 1( ) ( );f x y m x x x x x= + − ≤ ≤ (4.34) 1 1 1 1 2( ) ( );f x y m x x x x x= + − ≤ ≤ (4.35) … 1);()( +≤≤−+= iiiii xxxxxmyxf (4.36) with the slopes given by: ii ii i xx yym − − = + + 1 1 Quadratic and Cubic Splines First order splines are straight-forward. To fit a line in each interval between data points, we need only 2 pieces of data: the data values at the 2 points. If we now want to fit a higher order function, for example a second order polynomial, a parabola, we need to determine 3 coefficients, and in general, for an m-order spline we need m+1 equations. If we want a smooth function, over the complete domain, we should impose continuity of the representation in contiguous intervals and as many derivatives as possible to be continuous too at the adjoining points. For n+1 points, there are n intervals and consequently, n functions to determine. If we use quadratic splines (second order) their equations will be of the form: cxbxaxf iii ++= 2)( and we need 3 equations per interval to find the 3 coefficients, that is, a total of 3n equations. We can establish the following conditions to be satisfied: 1) The values of the functions corresponding to adjacent intervals must be equal at the common interior nodes (no discontinuities at the nodes) and to the value of the function there. This can be represented by: 2 21 1 1 ( )i i i i i i i i i i ia x b x c a x b x c f x− − −+ + = + + = (4.37) for i = 2,…,n, that is, at the node xi, boundary between intervals i−1 and i, the functions defined in each of these intervals must coincide. This gives us a total of 2n−2 equations (there are 2 equations in the above line and n−1 interior points). 2) The first and last functions must pass through the first and last points (end points). This gives us another 2 equations 21 1 1 1 1 1( )a x b x c f x+ + = and 2 1 1 1( )n n n n n na x b x c f x+ + ++ + = 3) The first derivative at the interior nodes must be continuous. This can be written as: 1 ‘( ) ‘ ( )i i i if x f x− = for i = 2,…,n. Then: 2 21 12 2i i i i i ia x b a x b− −+ = + (4.38) and we have another n−1 equations. All these give us a total of 3n−1 equations when we need 3n. An additional condition must then be established. Usually this can be chosen at the end points, for example, stating that a1 = 0 (This corresponds to asking for the second derivative to be zero at the first point and results in the two first points joined by a straight line). page 30 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez For example, for a set of 4 data points, we can establish the necessary equations as listed above, giving a total of 8 equations for 8 unknowns (having fixed a1 = 0 already). This can be solved by the matrix techniques that we will study later. Quadratic splines have some shortcomings that are not present in cubic splines, so these are preferred. However, their calculation is even more cumbersome than that of quadratic splines. In this case, the function and the first and second derivatives are continuous at the nodes. Because of their popularity, cubic splines are commonly found in computer libraries and for example, Matlab has a standard function that calculates them. To illustrate the advantages of this technique, Fig. 4.5 shows the cubic interpolation splines (blue line) for the original data set described in Fig. 4.3, and is compared with the cubic splines interpolation of the extended data set (red line, with two points added). The figure shows that there is little difference to the overall result when the 2 extra points are added, contrary to what happens with the single polynomial interpolation shown in Fig. 4.3/ Using the built-in Matlab function, the code for plotting the red line in this graph is simply: Fig. 4.5 Comparison of splines interpolation with 2 added points xd=[0,1,2,2.2,2.7,3,4,5]; yd=[2,1,3,3.5,4.25,4.8,5.8,4.8]; x=0:0.05:5; y=spline(xd,yd,x); plot(x,y,’r’,’Linewidth’,2.5), hold plot(xd,yd,’ok’,’MarkerSize’,8,’MarkerFaceColor’,’w’,’LineWidth’,2) Note that the these lines only include the drawing of the red line and the command for the markers is not exactl
y the one used since the added points have been drawn separately using the attributes ‘or’,’MarkerSize’,8,’MarkerFaceColor’,’r’. Exercise 4.4 Using Matlab plot the function xxexf 2sin2.11.0)( = in the interval [0, 4] and construct a cubic spline interpolation using the values of the function at the points xd = 0: 0.5: 4 (in Matlab notation). Use Excel to construct a table of divided differences for this function at the points xd and find the coefficients of the Newton interpolation polynomial. Use Matlab to plot the corresponding polynomial in the same figure as the splines and compare the results If the main objective is to create a smooth function to represent the data, it is sometimes preferable to choose a function that doesn’t necessarily pass exactly through the data but approximates its overall behaviour. This is what is called “approximation”. The problems here are how to choose the approximating function and what is considered the best choice. ELEC 0030 Numerical Methods page 31 © University College London 2019 – F.A. Fernandez 4.2 Approximation There are many different ways to approximate a function and you have seen some of them in detail already. Taylor expansions, least squares curve fitting and Fourier series are examples of this. Methods like the “least squares” look for a single function or polynomial to approximate the desired function. Another approach, of which the Fourier series is an example, consists of using a family of simple functions to build an expansion that approximates the given function. The problem then is to find the appropriate set of coefficients for that expansion. Taylor series are somehow related, the main difference is that while the other methods based on expansions attempt to find an overall approximation, Taylor series are meant to approximate the function at one particular point and its close vicinity. 4.2.1 Least Squares Curve Fitting One of the most common problems of approximation is fitting a curve to experimental data. In this case, the usual objective is to find the curve that approximates the data minimizing some measure of the error. In the method of least squares the measure chosen is the sum of the squares of the differences between the data and the approximating curve, If the data is given by: (xi, yi), i = 1, , n. We can define the error of the approximation by: ( )∑ = −= n i ii xfyE 1 2)( (4.39) The commonest choices for the approximating functions are polynomials, straight lines, parabolas, etc. Then, the minimization of the error will give the necessary relations to determine the coefficients. Approximation by a Straight Line The equation of a straight line is: bxaxf +=)( so the error to minimise is: ( )∑ = +−= n i ii bxaybaE 1 2)(),( (4.40) The error is a function of the parameters a and b that define the straight line. Then, the minimization of the error can be achieved by making the derivatives of E with respect to a and b equal to zero. These conditions give: ( ) 00)(2 111 =−−⇒=+−−= ∂ ∂ ∑∑∑ === n i i n i i n i ii xbnaybxaya E (4.41) ( ) 00)(2 1 2 111 =−−⇒=+−−= ∂ ∂ ∑∑∑∑ ==== n i i n i i n i ii n i iii xbxayxxbxayb E (4.42) which can be simplified to: ∑∑ == =+ n i i n i i yxbna 11 (4.43) and ∑∑∑ === =+ n i ii n i i n i i yxxbxa 11 2 1 (4.44) page 32 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez Solving the system for a and b gives: 2 11 2 1111 2         − − = ∑∑ ∑∑∑∑ == ==== n i i n i i n i ii n i i n i i n i i xxn yxxyx a and 2 11 2 111         − − = ∑∑ ∑∑∑ == === n i i n i i n i i n i i n i ii xxn yxyxn b (4.45) Example Fitting a straight line to the data given by: xd = {0, 0.2, 0.8, 1, 1.2, 1.9, 2, 2.1, 2.95, 3} and yd = {0.01, 0.22, 0.76, 1.03, 1.18, 1.94, 2.01, 2.08, 2.9, 2.95} Then, 15.15 10 1 =∑ =i ixd ; 08.15 10 1 =∑ =i iyd , 32.8425 10 1 2 =∑ =i ixd and 32.577 10 1 =∑ =i ii ydxd and the parameters of the straight line are: a = 0.01742473648290 and b= 0.98387806172746 The figure shows the approximating straight line (red) together with the curve for Newton interpolation (black) and for cubic splines (blue). The problems occurring with the use of higher order polynomial interpolation are also evident. Example The data (xd, yd) shown in the figure appears to behave in an exponential manner. Then, defining the variable )log( ii ydzd = , zd should vary linearly with xd. We can then fit a straight line to the pair of variables (xd, zd). If the fitting function is the function z(x), the function zey = is a least squares fit to the original data. Second Order Approximation If a second order curve is used for the approximation: 2)( cxbxaxf ++= , there are 3 parameters Fig. 4.6 Fig. 4.7 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 Newton Interpolation Cubic Splines Least Squares Line Fit 0 0.5 1 1.5 2 2.5 3 0 5 10 15 20 25 ELEC 0030 Numerical Methods page 33 © University College London 2019 – F.A. Fernandez to find and the error is given by: ( )∑ = ++−= n i i cxbxaycbaE 1 22)(),,( (4.45) Making the derivatives of E with respect to a, b and c equal to zero will give the necessary 3 equations for the coefficients of the parabola. The expressions are similar to those found for the straight line fit although more complicated. Matlab has standard functions to perform least squares approximations with polynomials of any order. If the data is given by (xd, yd), and m is the desired order of the polynomial to fit, the function: coeff = polyfit(xd, yd, m) returns the coefficients of the polynomial. If now x is the array of values where the approximation is wanted, y = polyval(coeff, x) returns the values of the polynomial fit at the point x. Exercise 4.5 Find the coefficients of a second order polynomial that approximates the function xexs =)( .in the interval [−1, 1] in the least squares sense using the points xd = [–1, 0, 1]. Plot the function and the approximating polynomial together with the Taylor approximation (Taylor series truncated to second order) for comparison. Approximation using Continuous Functions The same idea of “least squares”, that is, to try to minimize an error expression in the least squares sense, can be used to the approximation of continuous functions. Imagine that instead of discrete data sets we have a rather complicated analytical expression representing the function of interest. This case is rather different to all we have seen previously because now there is a certainty of the value of the desired variable at every point, but it might be desirable to have a simpler expression representing this behaviour, at least in a certain domain, for further analysis purposes. For example is the function is s(x) and we want to approximate it using a second order polynomial, we can formulate the problem as minimizing the square of the difference over the domain of interest, that is: ∫ Ω −++= dxxscbxaxE 22 ))(( (4.46) Again, asking for the derivatives of E with respect to a, b and c to be zero, will give us the necessary equations to find these coefficients. There is one problem though, the resultant matrix problem, particularly for large systems (high order polynomials) is normally very ill-conditioned, so some special precautions need to be taken. 4.2.2 Approximation using Orthogonal Functions We can extend the same ideas of least squares to the approximation of a given function by an expansion in terms of a set of “basis functions”. You are already familiar with this idea from the use of Fourier expansions to approximate functions, but the concept is not restricted to use of sines page 34 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez and cosines as basis functions. First of all, we called a base a family of functions th
at satisfy a set of properties. Similarly with what you know of vectors, for example the set of unit vectors xˆ , yˆ and zˆ constitute a base in the 3D space because any vector in that space can be expressed as a combination of these three. Naturally, we don’t actually need these vectors to be perpendicular to each other but that helps (as we’ll see later). However, the base must be complete, that is no vector in 3D escapes this representation. For example if we only consider the vectors xˆ and yˆ , no combination of them can possibly have a component along the zˆ axis and the resultant vectors are restricted to the xy- plane. In the same sense, we want a set of basis functions to be complete (as sines and cosines are) in order that any function can have its representation by an expansion. The difference now is that unlike the 3D space, we need an infinite set (that is the dimension of the space of functions). So, if we select a complete set of basis functions: )(xkφ , we can represent our function by: 1 ( ) ( ) ( ) n n k k k f x f x c xϕ = ≈ = ∑ (4.47) an expansion truncated to n terms. In this context, the error of this approximation is the function difference between the exact and the approximate functions: ( ) ( ) ( )n nr x f x f x= −  and we can use the least squares ideas again, seeking to minimise the norm of this error. That is: the quantity, (error residual): ∫ Ω −≡= dxxfxfxrR nnn 22 ))(~)(()( (4.48) with the subscript n because the error residual above is also a function of truncation level. This concept of norm is analogous to that of the norm or modulus of a vector: ∑ = =⋅= N i ivvvv 1 22  . In order to extend it to functions we need to introduce the inner product of functions, (analogous to the dot product of vectors): If we have two functions f and g defined over the same domain Ω, their inner product is the quantity: , ( ) ( )f g f x g x dx Ω = ∫ (4.49) The inner product satisfies the following properties: 1. , 0f f > for all nonzero functions 2. , ,f g g f= for all functions f and g. 3. 1 2 1 2, , ,f g g f g f gα β α β+ = + for all functions f and g and scalars α and β. Note: The above definition of inner product is sometimes extended using a weighting function w(x) in the form: ∫ Ω = dxxgxfxwgf )()()(, and provided that w(x) is non-negative it satisfies all the required properties. In a similar form as with the dot product of vectors, )(),()( 2 xfxfxf = Using the inner product definition, the error expression above can be written as: 2 22( ( ) ( )) ( ( ) ( )) ( ) 2 ( ), ( ) ( ), ( )n n n n n nR f x f x f x f x dx f x f x f x f x f x Ω = − = − = − +∫     (4.50) ELEC 0030 Numerical Methods page 35 © University College London 2019 – F.A. Fernandez and if we write )(~ xfn as: ∑ = = n k kkn xcxf 1 )()(~ φ , as above, we get: ∑∑∑ = == +−= n j n k jkjk n k kkn xxccxxfcxfR 1 11 2 )(),()(),(2)( φφφ (4.51) We can see that the error residual Rn, is a function of the coefficients ck of the expansion and then, to find those values that minimize this error, we can make the derivatives of Rn with respect to ck equal to zero for all k. That is: 0= ∂ ∂ k n c R for k = 1, …, n. (4.52) The first term is independent of ck, so it will not contribute and the other two will yield the general equation: )(),()(),( 1 xxfxxc k n j jkj φφφ =∑ = for k = 1, …, n. (4.53) Writing this down in detail gives: (k = 1) 11221111 ,,,, φφφφφφφ fccc nn =+++  (k = 2) 22222112 ,,,, φφφφφφφ fccc nn =+++  (k = n) nnnnnn fccc φφφφφφφ ,,,, 2211 =+++   which can be written as a matrix problem of the form: Φc = s, where the matrix Φ contains all the inner products (in all combinations), the vector c is the list of coefficients and s is the list of values in the right hand side (which are all known: we can calculate them all). We can find the coefficients solving the system of equations but we can see that this will be a much easier task if all crossed products of basic functions yielded zero; that is, if 0)(),( =xx jk φφ for all j ≠ k (4.54) This is what is called orthogonality and is a very useful property of the functions in a base. (Similar with what happens with a base of perpendicular vectors: all dot products between different members of the base are zero). Then, if the basis functions )(xkφ are orthogonal, the solution of the system above is straight-forward: )(),( )(),( xx xxf c kk k k φφ φ = (4.55) Remark: You can surely recognise here the properties of the sine and cosine functions involved in Fourier expansions: They form a complete set, they are orthogonal: 0)(),( =xx jk φφ where kφ is either xkπsin or xkπcos . and the coefficients of the expansions are given by the same expression (4.55) above. page 36 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez Fourier expansions are convenient but sinusoidal functions are not the only or simpler possibility. There are many other sets of orthogonal functions that can form a base, in particular, we are interested in polynomials because of the convenience of calculations. Families of Orthogonal Polynomials There are many different families of polynomials that can be used in this manner. Some are more useful than others in particular problems. They are often originated as solution to some differential equations. 1. Legendre Polynomials They are orthogonal polynomials in the interval [−1, 1], with weighting function 1. That is, 0)()()(),( 1 1 == ∫ − dxxPxPxPxP jiji if i ≠ j. They are usually normalised so that Pn(1) = 1 and their norm in this case is: 12 2)()()(),( 1 1 + == ∫ − n dxxPxPxPxP nnnn (4.56) The first few are: 1)(0 =xP xxP =)(1 2)13()( 22 −= xxP 2)35()( 33 xxxP −= 8)33035()( 244 +−= xxxP , )637015( 8 1 53 5 xxxP +−= )2313151055( 16 1 642 6 xxxP +−+−= )42969331535( 16 1 753 7 xxxxP +−+−= etc. Fig. 4.8 In general they can be defined by the expression: nn n n n n xxn xP )1( !2 )1()( 2− ∂ ∂− = (4.57) They also satisfy the recurrence relation: )()()12()()1( 11 xnPxxPnxPn nnn −+ −+=+ (4.58) 2. Chebyshev Polynomials The general, compact definition of these polynomials is: ))(coscos()( 1 xnxTn −= (4.59) and they satisfy the following orthogonality condition:      == ≠= ≠ = − = ∫ − 0if 0if2 if0 1 )()( )(),( 1 1 2 ji ji ji dx x xTxT xTxT jiji π π (4.60) That is, they are orthogonal in the interval [−1, 1] with the weighting function: 211)( xxw −= . They are characterised by having all their oscillations of the same amplitude and in the interval -1 0 1 -1 0 1 ELEC 0030 Numerical Methods page 37 © University College London 2019 – F.A. Fernandez [−1, 1] and also all their zeros occur in the same interval. The first few Chebyshev polynomials are: 1)(0 =xT xxT =)(1 12)( 22 −= xxT xxxT 34)( 33 −= 188)( 244 +−= xxxT xxxxT 52016)( 355 +−= 1184832)( 2466 −+−= xxxxT xxxxxT 75611264)( 3577 −+−= … etc. Fig. 4.9 They also can be constructed from the recurrence relation: )()(2)( 11 xTxxTxT nnn −+ −= for n ≥ 1. (4.61) Notice that for deriving expansions in terms of these polynomials, the modified inner product including the weighting function must be used. 3. Hermite Polynomials The definition of the Hermite polynomials is: 2 2 ( ) ( 1) n n x x n n dH x e e dx −= − and their orthogonalty condition is: 2 ( ) ( ) 2 !x nn m nmH x H x e dx nπ δ ∞ − −∞ =∫ (4.62) That is, they are orthogonal in (−∞, ∞) with the weighing function 2 ( ) xw x e−= The first few polynomials are: 0( ) 1H x = 1( ) 2H x x= 2 2( ) 4 2H x x= − 3 3( ) 8 12H x x x= − 4 2 4( ) 6 3H x x x= − + 5 3 5 ( ) 32 160 120H x x x x= − + 6 4 2 6 ( ) 64 480 720 120H x x x x= − + − and the rest can be derived usi
ng the recursive relation: 1 1( ) 2 ( ) 2 ( )n n nH x xH x nH x+ −= − (4.63) -1 0 1 -1 0 1 page 38 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez There are multiple other possibilities. For example, another commonly used family of functions are the Laguerre polynomials, which are orthogonal in [0, ∞) with weighting function xe− . Also, sets of orthogonal functions can be constructed starting from a “mother function” and considering its successive derivatives. These can be modified one by one so the whole set is orthogonal. For example, the set of Hermite polynomials is orthogonal but not orthonormal and needs a weighting function. This suggests modifying this set or, creating the functions (known as Hermite functions) in the form: 2 21( ) ( ) !2 x n x e H x n ψ π −= (4.64) which are orthonormal; that is, they satisfy: ( ) ( )n m nmx x xψ ψ δ ∞ −∞ =∫ . These functions correspond to the orthogonalized form of the derivatives of the Gaussian function (the mother function in this case). Example Approximate the function 221 1)( xa xf + = with a = 4 in the interval [−1, 1] in the least squares sense using Legendre polynomials up to order 2. The approximation is: ∑ = = 2 1 )()(~ k kk xPcxf and the coefficients are: ∫ − = 1 1 2 )()()( 1 dxxPxf xP c k k k with 12 2)( 2 + = k xPk . Form the expression above we can see that the calculation of the coefficients will involve integrals: ∫ − + = 1 1 221 dx xa xI m m which satisfy the recurrence relation: 222 1 )1( 2 −− − = mm Iaam I with a a I 10 tan 2 −= We can also see that due to symmetry Im = 0 for m odd. (Integral of an odd function over the interval [−1, 1]). Also, because of this, only even numbered coefficients are necessary. (Odd coefficients of the expansion are zero). The coefficients are then: 0 1 1 22 1 1 0 2 1 1 1 2 1)( 2 1 Idx xa dxxfc = + == ∫∫ −− ( )02 1 1 22 21 1 22 34 5 1 13 4 5)()( 2 5 IIdx xa xdxxfxPc −= + − == ∫∫ −− ( )024 1 1 44 3303516 9)()( 2 9 IIIdxxfxPc +−== ∫ − ELEC 0030 Numerical Methods page 39 © University College London 2019 – F.A. Fernandez ( )0246 1 1 66 510531523132 13)()( 2 13 IIIIdxxfxPc −+−== ∫ − ( )02468 1 1 88 3512606930120126435256 17)()( 2 17 IIIIIdxxfxPc +−+−== ∫ − The results are shown in Fig. 4.10, where the red line corresponds to the approximating curve. The green line at the bottom shows the absolute value of the difference between the two curves. The total error of the approximation (integral of this difference, divided by the integral of the original curve) is of 10.5%. Fig. 4.10 Exercise 4.6 Use cosine functions ( )cos( xnπ ) instead of Legendre polynomials (in the form of a Fourier series) and compare the results. Remarks: We can observe in Fig. 4.10 of the example above that the error oscillates through the domain. This is typical of least squares approximation, where the overall error is minimised. In this context, Chebyshev polynomials are the best possible choice. The error obtained with its approximation is the least possible with any other polynomial up to the same degree. Furthermore, these polynomials have other useful properties. We have seen before interpolation of data by higher order polynomials, using equally spaced data points and the problems that this cause were quite clear. However, one then can think if there is a different distribution of points that helps in minimising the error. The answer is yes, the optimum arrangement is to locate the data points on the zeros of the Chebyshev polynomial of the order necessary to give the required number of data points. These occur for:       −= π n kxk 2 12cos for k = 1, …, n. 4.2.3 Approximation to a Point The approximation methods we have studied so far attempt to reach a “gobal” approximation to a function; that is, to minimise the error over a complete domain. This might be desirable in many cases but there could be others where it is more important to achieve high approximation at one particular point of the domain and in its close vicinity. One such method is the Taylor approximation, where the function is approximated by a Taylor polynomial. -1 0 1 0 0.5 1 page 40 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez Taylor Polynomial Approximation Among polynomial approximation methods, the Taylor polynomial gives the maximum possible “order of contact” between the function and the polynomial; that is, the n-order Taylor polynomial coincides with the function and with its n derivatives at the point of contact. The Taylor polynomial for approximating a function f(x) at a point a is given by: n n ax n afaxafaxafafxp )( ! )()( !2 )(”))((‘)()( )( 2 −++−+−+=  (4.65) and the error incurred is given by: 1 )1( )( )!1( )( ++ − + n n ax n f ξ where ξ is a point between a and x. (See Appendix). The figure shows the function (blue line): xxxy ππ 2cossin)( += with the Taylor approximation (red line) using a Taylor polynomial of order 9 (Taylor series truncated at order 9). For comparison, the Newton interpolation using 9 equally spaced points (indicated with * and a black line), giving a single polynomial of order 9. While the Taylor approximation is very good in the vicinity of zero (9 derivatives are also matched there) it deteriorates badly away from zero. Fig. 4.11 If a function varies rapidly or has a pole, polynomial approximations will not be able to achieve high degree of accuracy. In that case an approximation using rational functions will be better. The polynomial in the denominator will provide the facilities for rapid variations and poles. Padé Approximation Padé Approximants are rational functions, or ratios of polynomials, that fit the value of a function and a number of its derivatives at one point. They usually provide an approximation that is better than that of the Taylor polynomials, in particular in the case of functions containing poles. A Padé approximation to a function f(x) that can be represented by a Taylor series in [a, b] (or Padé approximant) is a ratio between two polynomials pm(x) and qn(x) of orders m and n respectively: 01 2 2 01 2 2 )( )()()( bxbxbxb axaxaxa xq xpxRxf n n m m n mm n ++++ ++++ ==≈   (4.66) For simplicity, we’ll consider only approximations to a function at x = 0. For other values, a simple transformation of variables can be used. If the Taylor approximation at x = 0 (Maclaurin series) to f(x) is: 01 2 2 0 )( cxcxcxcxcxt kk k i i i ++++==∑ =  with nmk += (4.67) -1 0 1 -2 -1 0 1 ELEC 0030 Numerical Methods page 41 © University College London 2019 – F.A. Fernandez we can write: )( )()()( xq xpxRxt n mm n =≈ or )()()( xpxqxt mn = . (4.68) Considering now this equation in its expanded form: 01 2 201 2 201 2 2 ))(( axaxaxabxbxbxbcxcxcxc m m n n k k ++++=++++++++  (4.69) we can establish a system of equations to find the coefficients of pm and qm (or simply p and q). First of all, we can force this condition to apply at x = 0 (exact matching of the function at x = 0). This will give: )0()0()0( mn pqt = or: 000 abc = (4.70) but, since the ratio R doesn’t change if we multiply the numerator and denominator by any number, we can choose the value of 10 =b and this gives us the value of 0a . Taking now the first derivative of (4.69) will give: +++++++− ))(2( 011 2 2 1 bxbxbcxcxkc nn k k  12 1 12 1 01 2)2)(( axaxmabxbxnbcxcxc m m n n k k +++=++++++ −−  (4.71) and again, forcing this equation to be satisfied at x = 0 (exact matching of the first derivative), gives: 11001 abcbc =+ (4.72) which gives an equation relating coefficients 1a and 1b : 01101 bcbca =− (4.73) To continue with the process of taking derivatives of (4.69) we can establish first some general formulae. If we call g(x) the product of
denominators in the left hand side of (4.69), we can apply repeatedly the product rule to form the derivatives giving: )(‘)()()(‘)(‘ xqxtxqxtxg += )(”)()(‘)(‘2)()(”)(” xqxtxqxtxqxtxg ++=  ∑ = − − = i j jjii xqxt jij ixg 0 )()()( )()( )!(! !)( (4.74) and since we are interested on the values of the derivatives at x = 0, we have: ∑ = − − = i j jjii qt jij ig 0 )()()( )0()0( )!(! !)0( (4.75) The first derivative of a polynomial, say, )(xqn is: 12 1 2 bxbxnb nn +++ −  , so the second derivative will be: 23 2 232)1( bxbxnbn nn +⋅++− −  and so on. Then these derivatives evaluated at x = 0 are successively: jbjbbbb !,,,)432(,)32(,2, 4321 ⋅⋅⋅ Then, we can write (4.75) as: ∑∑ = − = − =−− = i j jji i j jji i bcibjcji jij ig 00 )( !!)!( )!(! !)0( and equating this to the ith derivative of )(xpm evaluated at x = 0, which is iai! gives: page 42 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez ∑ = −= i j jjii bca 0 , or ∑ = −+= i j jjiii bcbca 1 0 but since 10 =b we can finally write: i i j jjii cbca =−∑ = − 1 for i = 1, …, k, (k = m + n). (4.76) where we take the coefficients 0=ia for mi > and 0=ib for ni > . Example Consider the function x xf − = 1 1)( . This function has a pole at x = 1 and polynomial approximations will not perform well. The Taylor coefficients of this function are given by: !2 )12(531 j jc jj −⋅⋅ =  So the first five are: 10 =c , 2 1 1 =c , 8 3 2 =c , 16 5 3 =c , 128 35 4 =c which gives a polynomial of order 4. From the equations above we can calculate the Padé coefficients. We choose: m = n =2, so k = m + n = 4. Taking 10 =b , 000 abc = gives 10 =a The other equations (from asking the derivatives to fit) are: i i j jjii cbca =−∑ = − 1 for i = 1, …, k, (k = m + n) and in detail: 4403122134 33021123 220112 1101 cbcbcbcbca cbcbcbca cbcbca cbca =−−−− =−−− =−− =− but for m = n =2, we have 04343 ==== bbaa and the system can be written as: 1 0 1 1 2 1 1 0 2 2 2 1 1 2 3 3 1 2 2 4 0 0 0 0 0 0 a c b c a c b c b c c b c b c c b c b c − = − − = − − = − − = and we can see that the second set of equations can be solved for the coefficients bi. Re-writing this system: 2 1 1 2 3 3 1 2 2 4 c b c b c c b c b c + = − + = − In general, when 2kmn == as in this case the matrix that define the system will be of the form: ELEC 0030 Numerical Methods page 43 © University College London 2019 – F.A. Fernandez                 +−+−+− −−− −− − 0321 3012 2101 1210 rrrr rrrr rrrr rrrr nnn n n n      This is a special kind of matrix called the Toeplitz matrix that has the same element along each diagonal so it is defined by a total of 2n-1 numbers. There are special methods for solving systems with this type of matrix. Once the coefficients bi are known the coefficients ai are easily found. Solving the system will give: ]161,43,1[ −=a and ]165,45,1[ −=b giving: 2 2 2 210 2 2102 2 0.31251.251 0.06250.751)( xx xx xbxbb xaxaaxR +− +− = ++ ++ = The figure shows the function x xf − = 1 1)( (blue line) and the Padé approximant (red line) 2 2 2 2 0.31251.251 0.06250.751)( xx xxxR +− +− = The Taylor polynomial up to 4th order: 44 3 3 2 210)( xcxcxcxccxP ++++= with a green line. Fig. 4.12 It is clear that the Padé approximant gives a better fit, particularly closer to the pole. Increasing the order, the Padé approximation gets closer to the singularity. The figure shows the approximations for m = n = 1, 2, 3 and 4. The poles closer to 1 of this functions are: 333.1:)(11 xR 106.1:)(22 xR 052.1:)(33 xR 031.1:)(44 xR Fig. 4.13 0 1 0 2 4 6 8 0 0.5 1 0 2 4 6 8 page 44 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez We can see that even )(11 xR , a ratio of linear functions of x, gives a better approximation than the 4th order Taylor polynomial. Exercise 4.7 Using the Taylor (McLaurin) expansion of f(x) = cos x, truncated to order 4: 242 1)( 424 0 xxxcxt k k k +−== ∑ = Find the Padé approximant: 01 2 2 01 2 2 2 22 2 )( )()( bxbxb axaxa xq xpxR ++ ++ == ELEC 0030 Numerical Methods page 45 © University College London 2019 – F.A. Fernandez 5. Matrix Computations We have seen earlier that a number of issues arise when we consider errors in the calculations dealing with machine numbers. When matrices are involved, the problems of accuracy of representation, error propagation and sensitivity of the solutions to small variations in the data are much more important. Before discussing any methods of solving matrix equations, we consider first the rather fundamental matrix property of ‘condition number’. 5.1 ‘Condition’ of a Matrix We have seen that multiplying or dividing two floating-point numbers gives an error of the order of the ‘last preserved bit’. If, say, two numbers are held to 8 decimal digits, the resulting product (or quotient) will effectively have its least significant bit ‘truncated’ and therefore have a relative uncertainty of ± 10–8. By contrast, with matrices and vectors, multiplying (that is, evaluating y = Ax) or ‘dividing’ (that is, solving Ax = y for x) can lose in some cases ALL significant figures! Before examining this problem we have to define matrix and vector norms. 5.1.1 Vector and matrix norms To introduce the idea of ‘length’ into vectors and matrices, we have to consider norms: Vector norms If xT = (x1, x2,…., xn) is a real or complex vector, a general norm is denoted by Nx and is defined by: 1 1 Nn N iN i x =   =     ∑x (5.1) So the usual “Euclidian norm”, or “length” is 2x or simply x : 2212 … nxx ++=x (5.2) Other norms are used, e.g. 1x and ∞x , the latter corresponding to the greatest in magnitude |xi| (Show this as an exercise). Matrix norm If A is an n-by-n real or complex matrix, we denote its norm defined by: 0 max NN x N ≠   =      Ax A x for any choice of the vector x (5.3) According to our choice of N, in defining the vector norms by (5.1), we have corresponding 1Ax , 2Ax , ∞Ax , the Euclidian N = 2 being the commonest. Note that (ignoring the question “how do we find its value”), for given A and N, A has some specific numerical value ≥ 0. page 46 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez 5.1.2 ‘Condition’ of a Linear System Ax = y This is in the context of Hadamard’s general concept of a ‘well-posed-problem’, which is roughly one where the result is not too sensitive to small changes in the problem specification. Definition: The problem of finding x, satisfying Ax = y, is well posed or well conditioned if: (i) a unique x satisfies Ax = y, and (ii) small changes in either A or y result in small changes in x. For a quantitative measure of “how well conditioned” a problem is, we need to estimate the amount of variation in x when y varies and/or the variation in x when A changes slightly or the corresponding changes in y when either (or both) x and A vary. Suppose A is fixed, but y changes slightly to y + δy, with the associated x changing to x + δx. We have: Ax = y, (5.4) and so: A (x + δx) = y + δy Subtracting gives: A δx = δy or δx = A–1 δy (5.5) From our definition (5.3), we must have for any A and z: A z Az ≤ and so zAAz ⋅≤ (5.6) Taking the norm of both sides of (5.5) and using inequality (5.6) gives: yAyAx δδδ ⋅≤= −− 11 (5.7) Taking the norm of (5.4) and using inequality (5.6) gives: xAxAy ⋅≤= (5.8) Finally multiplying corresponding sides of (5.7) and (5.8) and dividing by yx gives our fundamental result
: y y AA x x δδ 1−≤ (5.9) For any square matrix A we introduce its condition number and define: cond(A) = 1−AA (5.10) We note that a ‘good’ condition number is small, near to 1. 5.1.3 Relevance of the Condition Number The quantity yyδ can be interpreted as a measure of relative uncertainty in the vector y. Similarly, xxδ is the associated relative uncertainty in the vector x. From equations (5.9) and (5.10), cond(A) gives an upper bound (worst case) factor of ELEC 0030 Numerical Methods page 47 © University College London 2019 – F.A. Fernandez degradation of precision between y and x = A–1y. Note that if we re-write the equations from (5.4) for A–1 instead of A (and reversing x and y), equation (5.9) will be the same but with x and y reversed and (5.10) would remain unchanged. These two equations therefore give the important result that: If A denotes the precise transformation y = Ax and δx, δy are related small changes in x and y, the ratio: yy xx δ δ must lie between 1/ cond(A) and cond(A). Example Here is an example using integers for total precision. Suppose:       = 9899 99100 A We then have: Ax = y:       =      − 1000 1000 1000 1000 A Shifting x slightly gives:       =      − 1197 1199 999 1001 A Alternatively, shifting y slightly:       =      − 999 1001 801 803 A So a small change in y can cause a big change in x or vice versa. We have this clear moral, concerning any matrix multiplication or (effectively) inversion: For given A, either multiplying (Ax) or ‘dividing’ (A–1y) can be catastrophic, the degree of catastrophe depending on cond(A) and on the ‘direction’ of change in x or y. In the above example, cond(A) is 39206 (approximately 40000). 5.2 Matrix Computations – Problem Types We now consider methods for solving matrix equations. The most common problems encountered are of the form: A x = y (5.11) or A x = 0, requiring det(A) = 0 (5.12) (we will consider this a special case of (5.11)) or A x = λ B x (5.13) where A (and B) are known n×n matrices and x and λ are unknown. Usually B (and sometimes A) is positive definite (meaning that xTBx > 0 for all x A is sometimes complex (and also B), but numerically the difference is straightforward, and so we will consider only real matrices. page 48 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez 5.2.1 Types of Matrices A and B (Sparsity Patterns) We can classify the problems (or the matrices) according to their sparsity (that is, the proportion of zero entries). This will have a strong relevance on the choice of solution method. In this form, the main categories are: dense, where most of the matrix elements are nonzero and sparse, where a large proportion of them are zero. Among the sparse, we can distinguish several types: Banded matrices: all nonzero elements are grouped in a band around the main diagonal (fixed and variable band). Arbitrarily sparse Finally, sparse matrix with any pattern, but where the elements are not stored; that is, elements are ‘generated’ or calculated each time they are needed in the solving algorithm. * * * 0 0 0 * * * * 0 0 * * * * * 0 0 * * * * * 0 0 * * * * 0 0 0 * * *                   Fig. 5.1 Zeros and non-zeros in band matrix of semi-bandwidth 3 We can also distinguish between different solution methods, basically: DIRECT where the solution emerges in a finite number of calculations (if we temporarily ignore round-off error due to finite word-length). INDIRECT or iterative, where a step-by-step procedure converges towards the correct solution. Indirect methods can be specially suited to sparse matrices (especially when the order is large) as they can often be implemented without the need to store the entire matrix A (or intermediate forms of matrices) in high speed memory. All the common direct routines are available in software libraries and in books and journals, most commonly in Fortran or Fortran90/95, but also some in C. 5.3 Direct Methods of Solving Ax = y 5.3.1 Gauss elimination or LU decomposition The classic solution method of (5.11) is by the Gauss method. For example, given the system:           =                     1 1 1 1163 852 741 3 2 1 x x x (5.14) we subtract 2 times the first row from the second row, and then we subtract 3 times the first row from the third row, to give:           − −=                     −− −− 2 1 1 1060 630 741 3 2 1 x x x (5.15) and then subtracting 2 times the second row from the third row gives: ELEC 0030 Numerical Methods page 49 © University College London 2019 – F.A. Fernandez           −=                     −− 0 1 1 200 630 741 3 2 1 x x x (5.16) The steps from (5.14) to (5.16) are termed ‘triangulation’ or ‘forward-elimination’. The triangular form of the left-hand matrix of (5.16) is crucial; it allows the next steps. The third row immediately gives: x3 = 0 (5.17a) and substitution into row 2 gives: (–3) x2 + 0 = –1 and so: x2 =1/3 (5.17b) and then into row 1 gives: x1 + 4 (1/3) + 0 = 1 and so: x1 = –1/3 (5.17c) The steps through (5.17) are called ‘back-substitution’. At this point we ignore the complications involved with ‘pivoting’, a technique sometimes required to improve numerical stability and which consists of changing the order of rows and columns. Important points about this algorithm: 1. When performed on a dense matrix, (or on a sparse matrix but not taking advantage of the zeros), computing time is proportional to n3 (n: order of the matrix). This means that doubling the order of the matrix will increase computation time by up to 8 times! 2. The determinant comes immediately as the product of the diagonal elements of (5.16). 3. Algorithms that take advantage of the special ‘band’ and ‘variable-band’ are very straightforward, by changing the limits of the loops when performing row or column operations, and some ‘book-keeping’. For example, in a matrix of ‘semi-bandwidth’ 3, the first column has non-zero elements only in the first 3 rows as in Fig. 5.1. Then only those 3 numbers need storing, and only the 2 elements below the diagonal need ‘eliminating’ in the first column. 4. Oddly, it turns out that, in our context, one should NEVER find the inverse matrix A in order to solve Ax = y for x. Even if it needs doing for a number of different right-hand-side vectors, y, it is better to ‘keep a record’ of the triangular form of (5.16) and back-substitute as necessary. 5. Other methods very similar to Gauss are due to Crout and Choleski. The latter is (only) for use with symmetric matrices. Its advantage is that time AND storage are half that of the orthodox Gauss. 6. There are variations in the implementation of the basic method developed to take advantage of the special type of sparse matrices encountered in some cases, for example when solving problems using finite differences or finite elements. One of these is the frontal method; here, elimination takes place in carefully controlled manner, with intermediate results being kept in backing-store. Another variation consists of only storing the ‘nonzero’ elements in the matrix, at the expense of a great deal of ‘book-keeping’ and reordering (renumbering) rows and columns through the process, in the search for the best compromise between numerical stability and fill-in. 5.3.2 LU Factorisation In practice and particularly for large sparse matrices, the Gauss elimination
method takes the form of an LU factorisation. That is, the matrix A is factorised in the form A = LU, where L and U are lower triangular and upper triangular matrices, respectively. Once A is in that form, the problem to solve is: LUx = y. This can be written as Lz = y defining page 50 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez the auxiliary vector z = Ux, which can be found simply by forward substitution. (n steps). Then, x is found from Ux = z by backward substitution. The most expensive part of this algorithm is the factorisation itself. There are several variations here, where one of the most popular is the Choleski factorisation where T=L U (or T=U L ) applicable to symmetric positive definite matrices.. 5.4 Iterative Methods of Solving Ax = y We will outline 2 types of methods: (1a) the Jacobi (or simultaneous displacement) and (1b) the closely related Gauss-Seidel (or successive displacement) algorithm and (2) Gradient methods and its most popular version, the conjugate gradient method. 5.4.1 Jacobi and Gauss-Seidel Methods Two algorithms that are simple to implement are the closely related Jacobi (simultaneous displacement) and Gauss-Seidel (successive displacement or ‘relaxation’). a) Jacobi Method or Simultaneous Displacement Suppose the set of equations for solution are: a11x1 + a12x2 + a13x3 = y1 a21x1 + a22x2 + a23x3 = y2 (5.18) a31x1 + a32x2 + a33x3 = y3 This can be re-organised to: x1 = ( y1 – a12x2 – a13x3)/a11 (5.19a) x2 = ( y2 – a23x3 – a21x1)/a22 (5.19b) x3 = ( y3 – a31x1 – a32x2)/a33 (5.19c) Suppose we had the vector x(0) = [x1, x2, x3](0), and substituted it into the right-hand side of (5.19), to yield on the left-hand side the new vector x(1) = [x1, x2, x3](1). Successive substitutions will give the sequence of vectors: x(0), x(1), x(2), x(3), . . . . Because (5.19) is merely a rearrangement of the equations for solution, the ‘correct’ solution substituted into (5.19) must be self-consistent, i.e. yield itself! The sequence will: either converge to the correct solution or diverge. If the matrix A is split in the form: A = L + D + U (do not confuse with the L and U factors) where L, D and U are the lower, diagonal and upper triangular parts of A, the Jacobi algorithm can be written in matrix form as: 1 1( )k k k+ −= − −x D y Lx Ux . b) Gauss-Seidel Method or Successive Displacement Note that when eq. (5.19a) is applied, a ‘new’ value of x1 would be available, which could be used instead of the ‘previous’ x1 value when solving (5.19b). And similarly for x1 and x2 when applying (5.19c). This is the Gauss-Seidel or successive displacement iterative scheme, illustrated here with an example, showing that the computer program (in Matlab) is barely more complicated ELEC 0030 Numerical Methods page 51 © University College London 2019 – F.A. Fernandez than writing down the equations. % Example of successive displacement x1 = 0. ; x2 = 0. ; x3 = 0. ; % Equations being solved are: for i=1:10 x1 = (4. + x2 – x3)/4. ; % 4*x1 – x2 + x3 = 4 x2 = (9. – 2.*x3 – x1)/6. ; % x1 +6*x2 + 2*x3 = 9 x3 = (2. + x1 + 2.*x2)/5. ; % -x1 -2*x2 + 5*x3 = 2 fprintf(‘%-18.8g %-18.8g %-18.8g\n’,x1, x2, x3) end 1 1.3333333 1.1333333 1.05 0.94722222 0.98888889 0.98958333 1.0054398 1.0000926 1.0013368 0.99974633 1.0001659 0.99989511 0.99996218 0.9999639 0.99999957 1.0000121 1.0000048 1.0000018 0.99999811 0.99999961 0.99999962 1.0000002 1 1 0.99999999 1 1 1 1 Using the same split form of the matrix A, the Successive Displacement or Gauss-Seidel algorithm can be written as: 1 1 1( )k k k+ − += − −x D y Lx Ux or 1 1( ) ( )k k+ −= + −x L D y Ux Whether the algorithm converges or not depends on the matrix A and (surprisingly) not on the right-hand-side vector y of (5.18). Convergence does not even depend on the ‘starting value’ of the vector, which only affects the necessary number of iterations. We will skip over any formal proof of convergence. But the schemes converge if-and-only- if all the eigenvalues of the matrix: D–1(U + L) {for simultaneous displacement} or (D + L)–1U {for successive displacement} lie within the unit circle. For applications, it is simpler to use some sufficient (but not necessary!) conditions, when possible, such as: (a) If A is symmetric and positive-definite, then successive displacement converges. (b) If, in addition, ai,j < 0 for all i ≠ j, then simultaneous displacement also converges. Condition (a) is a commonly satisfied condition. Usually, the successive method converges in about half the computer time of the simultaneous, but, strictly there are matrices where one method converges and the other does not, and vice-versa. c) Successive Overrelaxation (SOR) method A variation of the Successive Displacement method is called the Successive Overrelaxation Method or SOR. page 52 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez From the matrix form of the Gauss-Seidel algorithm, we can write, subtracting kx from both sides: 1 1 1( )k k k k k+ − +− = − − −x x D y Lx Dx Ux (Gauss-Seidel) and we can think of this difference as a “correction” to the previous value and in every iteration, a new correction will be made. So, we can think of accelerating convergence if we go beyond this correction; that is to correct by a bit more or “over-relax” the system. The idea of the SOR is then to calculate: 1 1( )k k k kω+ += + −x x x x where ω, 1 < ω < 2 is a correction parameter. Note that with ω = 1, the SOR reverts to the standard Gauss-Seidel. The SOR algorithm then becomes: 1 1 1( )k k k k kω+ − += + − − −x x D y Lx Dx Ux or 1 1 11( ) [ ( ) ]; 1 2k kωω ω ω + − −= + + − < The algorithm to calculate each term at iteration k+1 for the three methods is then; for i =1, n 1 1 1 1 i nk k k k i i i ij j ij j ii j j i x x y a x a x a − + = =   = + − −     ∑ ∑ Jacobi 1 1 1 1 1 i nk k k k i i i ij j ij j ii j j i x x y a x a x a − + + = =   = + − −     ∑ ∑ Gauss-Seidel 1 1 1 1 i n k k k k i i i ij j ij j ii j j i x x y a x a x a ω −+ + = =   = + − −     ∑ ∑ SOR end 5.4.2 Gradient Methods Although known and used for decades before, these methods came to be adopted in the 1980’s as one of the most popular iterative algorithms for solving linear systems of equations: A x = y. The fundamental idea of this approach is to define the residual y – A x for some trial vector x and proceed to minimize the residual with respect to the components of x.. The equation to be solved for x: A x = y (5.20) can be recast as: “finding x to minimize the residual”, a column vector r defined as a function of x by: r = y – A x (5.21) The general idea of this kind of methods is to search for the solution (minimum of the residual) in a multidimensional space (spanned by the components of vector x), starting from a point x0 and choosing a direction to move. The optimum distance to move along that direction can then be calculated. In the steepest descent method, the simplest form of this method, these directions are chosen as those of the gradient of the error residual at each iteration point. Because of this, they will be mutually orthogonal and then there will be no more than n different directions. In 2D, see Fig. 5.2, this means that every time we’ll have to make a change of direction at right angle to the previous, but this will no t always allow us to reach the minimum or at least not in an efficient way. The norm r of this residual vector is an obvious choice for the quantity to minimise, and ELEC 0030 Numerical Methods page 53 © University College London 2019 – F.A. Fernandez also 2r , the square of the norm of the residual, which is not negative and is only zero when the error is zero and there are no square roots to calculate. However, using (5.21), gives (if A is symmetric: AT = A): 2r = (y – Ax)T(y – Ax) = xTAAx – 2xTAy + yTy (5.22) which is rather awkward to compute, because of the product AA. Another possible choice of error function (measure of the error), valid for the case where the matrix A is positive definite and which is also minimised for the correct solution is the function: h2(x) = rTA–1r (instead of rTr as in (5.22). This gives a simpler form (see Appendix, section 5): 2 1 1 1( ) ( ) 2T T T T Th − − −= = − − = − +r A r y A x A y A x y A y x y x Ax (5.23) Because the first term in the r.h.s. of (5.23) is independent of the variables and will play no part in the minimisation, it can be omitted. Also, for convenience we can also slightly change the definition of the error function dividing by a factor of 2 so we finally have: 2 1 2 T Th = −x Ax x y (5.24) which in expanded form is: 2 1 2 i ij j i ii j i h x a x x y= −∑∑ ∑ . We can also note from (5.24) that 2h∇ = − = −Ax y r (5.25) The Steepest Descent Method: The algorithm proceeds by evaluating the error function h2 at a starting point x0 and then choosing a search direction. It would seem obvious that the direction to choose in this case is the direction of the negative of the gradient of the error function at that point. From (5.25) we can see that this search direction is 2h= −∇ =p r . The next step is to find the minimum value of the error function along that line. That is, if =p r is the opposite direction of the gradient (p pointing down), the line is defined by all the points ( )α+x p , where α is a scalar parameter. The next step is to find the value of α that minimizes the error. This gives the next point x1. (Since in this case, α is the only variable, it is simple to calculate the gradient of the error function as a function of α and finding the corresponding minimum). (Note that h2 is quadratic so the 2D contour lines will be quadratic curves – not as shown in the figure.) Exercise 5.1: Show that the value of α that minimizes the error function in the line ( )α+x p in the two cases mentioned above (using the squared norm of the residual as error function or the function h2), for Fig. 5.2 u v x0 x1 page 54 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez a symmetric matrix A, is given respectively by: ( )2 T α − = p A y A x Ap and ( ) T Tα − = p y Ax p Ap The algorithm is then: Take 0 =x 0 , 0 =r y , 0 0=p r . (Or 0 0=x x , 0 0= −r y Ax , 0 0=p r ) for n = 1, 2, 3, … - Compute a step length: 1 1 1 1 1 1 1 1 T T n n n n n T T n n n n α − − − − − − − − = = p r r r p Ap r Ar - Update the approximate solution: 1 1n n n nα− −= +x x p - Update the residual: 1 1 1 1( )n n n n n n n nα α− − − −= − = − + = −r y Ax y A x p r Ap - Set the new search direction: n n=p r end Note that it is not necessary to normalize the vector p because any normalization factor will be cancelled in the calculation of the product αp for the next step. It turns out that the best direction to choose is not that of the negative of the gradient which that is the one taken in the conventional “gradient descent” or “steepest descent” method. In this case the consecutive directions are all orthogonal to each other as illustrated in Fig. 5.2 and this conduces to poor convergence. There are only n different directions and the algorithm will start using them again and again in a “multidimensional zig-zag”. Several variations appear at this point. The more efficient and popular “Conjugate Gradient Method” looks in a direction which is ‘A–orthogonal’ (or “conjugate”) to the previous one (choosing 1 0 T n n− =p Ap instead of 1 0 T n n− =p p ) in order to avoid this. The reason why this choice is made is beyond the scope of this course but the idea can be understood if we see that in 2D (2×2 matrices) the contours of the error function (if A is symmetric and positive definite) are ellipses and the perpendicular to an ellipse from an arbitrary point in the ellipse doesn’t pass through the minimum. The choice of the A-orthogonal direction to the previous one is equivalent to transform the variables such that the contours are circles, in which case the perpendicular will pass exactly through the centre (minimum). For this reason, this algorithm is guaranteed to converge in a maximum of n iterations. Of course, if the order is large the convergence can still be very slow in some cases. A useful feature of these methods (as can be observed from the expressions above) is that reference to the matrix A is only via simple matrix products; at any iteration, we need only form the product of A times a vector (Ax0 to start and then Apn-1). These can be formed from a given sparse matrix A without unnecessary multiplication (of non-zeros) or storage. The Conjugate Gradient Algorithm The basic steps of the algorithm are as follow: - Choose a starting point 0x and calculate 0 0= −r y A x - Choose first direction to search. In the first step we choose that of the negative gradient of h2: and this coincides with the direction of the residual r: 2 12( ) ( ) T Th∇ = ∇ − = − = −0 0 0 0x Ax x y Ax y r so we choose: 0 0=p r . - For n = 1, 2, 3, … ELEC 0030 Numerical Methods page 55 © University College London 2019 – F.A. Fernandez - Calculate the step length: ( )1 1 1 1 1 1 1 1 T T n n n n n T T n n n n α − − − − − − − − − = = p y A x p r p A p p A p . - Calculate the new point: 1 1n n n nα− −= +x x p - Calculate the new residual: 1 1n n n nα− −= −r r Ap - Calculate the next search direction. Here is where the methods differ. For the conjugate gradient the new vector p is not r but instead: 1n n n nβ −= +p r p where 1 1 T n n n T n n β − − = r r r r (gradient correction factor) - end In many cases it is necessary to use a preliminary ‘pre-conditioning’ of the matrix A to improve the convergence rate, particularly when the condition number of A is large. This leads to the popular ‘PCCG’ or Pre-Conditioned-Conjugate-Gradient algorithm, as a complete package and to many variations in implementation that can be found as commercial packages using different preconditioning methods. 5.4.3 Preconditioning One of the problems encountered frequently when using iterative methods to solve the problem =Ax y is low convergence rate. This can happen when the condition number of the matrix A is large and the problem becomes more serious when the matrices are very large. The solution to this is to modify somehow the problem to make it easier to solve. For example, if we modify the matrix equation multiplying it by another matrix or matrices we will get a new system: =Ax y   which might be easier to solve. A trivial system that is solved with no extra cost is when =A I , the identity matrix. But obviously, that would mean for example, premultiplying A by its inverse, but we know that calculating the inverse is a very costly procedure, particularly when the matrices are large and sparse. However, we can think of an approximation to this, selecting the transformation so that A is as close as possible to the identity matrix but the transformation is easier to calculate. There are several ways of performing this transformation: One is the left preconditioning. The idea here is to select a matrix M that resembles A such that 1−M A resembles the identity matrix. left preconditioning: 1 1− −=M Ax M y 1 1, , ,− −= = = =Ax y A M A x x y M y     It can also be right preconditioning: 1− =AM Mx y 1, , ,−= = = =Ax y A AM x Mx y y     or two-sided preconditioning: 1 1 11 2 2 1 − − −=M AM M x M y 1 1 11 2 2 1, , , − − −= = = =Ax y A M AM x M x y M y     To perform matrix-vector multiplications of the form: =u Av , a common need in these iterative methods, the explicit calculation of the inverse of the matrices M can be avoided. For example, for the case of two-sided preconditioning, the product can be calculated by: page 56 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez (i) find 1u from 2 1 =M u v (ii) find 2u as 2 1=u Au (iii) find u from 1 2=M u u And similarly, but simpler for the other forms. Splitting the matrix A in the form: A = L + D + U where L, D and U are the lower triangular, diagonal and upper triangular parts of A, some common forms of preconditioners are: Jacobi: M = D Gauss-Seidel: M = D + L SOR: 1 ω = +M D L where ω is a scalar parameter in [0, 2]. Symmetric Gauss-Seidel 11 ( ) ( ) −= + +M D L D D U T=U L for symmetric matrices Symmetric SOR: 11 2 1 1 1 1( ), ( ) ( ) 2 ω ω ω ω −= + = + − M D L M D D U A very common method is to use an incomplete LU decomposition, and in particular for symmetric, positive definite matrices, an incomplete Cholesky ( T=A LL ) decomposition (or IC). In a “full” decomposition, factorising a sparse matrix in the form =A LU leads to an increase in the number of nonzero values or “fill-in”. In the incomplete decomposition, the fill-in is partially or totally ignored, leading to just an approximation of the matrix A. This can be used for preconditioning using 1 2 T inc inc= =M M M L L for the IC: 1 1( )T Tinc inc inc inc − − −=L AL L x L y 1 1, , ,T Tinc inc inc inc − − −= = = =Ax y A L AL x L x y L y     This is the case of the popular “ICCG” method or Incomplete Choleski Conjugate Gradients, where the CG method is applied to a matrix preconditioned using IC. Example of Conjugate Gradients Fig. 5.3 Comparison of Convergence ELEC 0030 Numerical Methods page 57 © University College London 2019 – F.A. Fernandez Fig. 5.3 shows the results of using the Conjugate Gradients method with and without preconditioning to solve the problem Ax = b where the matrix A is symmetric of order 500 and consists of 3 columns defined by the Matlab command: A=sparse(diag(sqrt(1:n)) +diag(cos(1:(n-10)),10) +diag(cos(1:(n-10)),-10)); and the right hand side vector is b = A*ones(500,1). It can be seen that even the simple Jacobi preconditioner gives in this case a significant acceleration of convergence. These calculation were done using the script ExamplePCG.m 5.4.4 Krylov Methods Several iterative methods are based in the same basic strategy. They rely on the accurate calculation of a Krylov space that contains the solution and only differ in the definition of those spaces and the search method. A Krylov method solves Ax = b by repeatedly performing matrix- vector multiplications involving A. They are the most successful methods available at present to solve problems of linear algebra. Consider two vectors a and b of dimension 3. If they are not parallel, they will form a plane and we can think of all the vectors in 3D that lie on that plane. All those vectors can be found as linear combinations of a and b, and form what is called a subspace of the full 3D space (a plane). Since a and b generate all the vectors in the subspace, they are a base in that subspace. For convenience, frequently they are modified so they become orthogonal, since orthogonal bases are easier to handle and we can denote the subspace as: { , }S span= a b . Now, let’s consider vectors of large dimensions, as for example, the solution vector of a matrix equation of order n. For any vector r, if it is easy to calculate the product Ar, then it is easy to generate the sequence of vectors: 2A r , 3A r , …. and so on and we can think of the sequence of spaces of increasing order k: 2 3 1( , ) { , , , , , }, 1,2,kkK span k −= =A r r Ar A r A r A r  (5.26) known as the Krylov subspaces. Krylov subspace iterative methods normally start from an initial guess 0 0=x and with r = b and find the solution x of Ax = b by computing successively by iterations the vectors , 1,2,k k =x  such that kx is in the subspace ( , )kK A b : ( , ), 1,2,k kK k∈ =x A b  (5.27) The conjugate gradients method is one of these methods and applies when the matrix A is symmetric and positive definite. It requires only one matrix-vector product with A and 5 simple vector operations in each iteration to calculate iterates kx satisfying (5.25), which minimise k − Ax x over the Krylov subspace. (The A-norm is defined as T=Av v Av ). If the matrix A is symmetric but indefinite, another version of Krylov method, the minimum residual or MINRES needs one matrix-vector product with A and 7 simple vector operations in each iteration to calculate iterates kx satisfying (5.25), which minimise the Euclidean norm of the residual. If the matrix A is not symmetric, the problem is more difficult and there are several methods that can be used. One of the most common is the Generalised Minimum Residual (GMRES). This also computes iterates that minimise the Euclidean norm of the residual but requires an increasing page 58 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez amount of computation and storage at each successive iteration to achieve this. Then, GMRES is a good method if only a few iterations are needed, which can be the case when a good preconditioner is used. Otherwise, the memory and computing time requirements can be excessive. One way to ameliorate this problem is to restart GMRES at regular intervals. This reduces the memory requirements but it can also slow convergence. As will be seen in the next section, the sequence of vectors kA r will become more and more parallel to the dominant eigenvector of A as the iterations progress. So, in order to construct appropriate bases in these subspaces, orthogonalisation of the vectors is a very important step. The GMRES algorithm can be described as: choose an initial guess 0x form the residual vector: 0= −r b Ax normalise: 1 2 = rq r for 1,2, ,k m=  k=y Aq (form the consecutive products) for 1,2, ,j k=  Tjk jh = q y find components of y along the previous vectors jq jk jh= −y y q removal of this component (Gram-Schimdt orthogonalisation) end 1, 2k kh + = y calculate the norm of y 1 1,k k kh+ +=q y normalise the new vector Find kc that minimise: 2k −Hc z where 2( ,0,0 0)=z r  . 0k k k= +x x Q c end The matrix H of dimensions ( 1)k k+ × is formed with the elements ,j kh and the columns of the matrix Q are the Krylov vectors kq . ELEC 0030 Numerical Methods page 59 © University College London 2019 – F.A. Fernandez Example of GMRES with preconditioning Fig. 5.4 shows the results of using the GMRES method with and without preconditioning to solve the problem Ax = b where the matrix A is of order 500 and not symmetric. It consists of 3 columns defined by the Matlab command: A=sparse(diag(sqrt(1:n)) +diag(cos(1:(n-10)),10) +diag(sin(1:(n-10)),-10)); The right hand side vector is given by: b = A*ones(500,1). The problem has been solved without restarts and uses the script ExamplePGMRes.m. 5.5 Matrix Eigenvalue Problems The second class of matrix problem that occurs frequently in the numerical solution of differential equations as well as in many other areas, is the matrix eigenvalue problem. In many occasions, the matrices will be large and very sparse; in others, they will be dense and normally the solut ion methods will have to take these characteristics into account in order to achieve a solution in an efficient way. There is a number of different methods to solve matrix eigenvalue problems involving dense or sparse matrices, each of them with different characteristics and better adapted to different types of matrices and requirements. 5.5.1 Choice of method The choice of method will depend on the characteristics of the problem (type of the matrices) and on the solution requirements. For example, most methods suitable for dense matrices calculate all eigenvectors and eigenvalues of the system. However, in many problems arising from the numerical solution of PDEs one is only interested in one or just a few eigenvalues and/or eigenvectors. Also, in many cases the matrices will be large and sparse. In what follows we will concentrate in methods which are suitable for sparse matrices (of course the same methods can be applied to dense matrices). The problem to solve can have two different forms: λ=Ax x (standard eigenvalue problem) (5.28) λ=Ax Bx (generalized eigenvalue problem) (5.29) Sometimes the generalized eigenvalue problem can be converted into the form (5.28) simply by premultiplying by the inverse of B: 1 λ− =B Ax x However, if A and B are symmetric, the new matrix in the left hand side ( 1−B A ) will have lost this property. Instead, it is preferable to decompose the matrix B (factorise) in the form T=B LL (Choleski factorisation - possible if B is positive definite). Substituting in (5.29) and Fig. 5.4 Comparison of Convergence page 60 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez premultiplying by 1−L will give: 1 Tλ− =L Ax L x and since: 1( )T T− =L L I , the identity matrix, 1 1( )T T Tλ− − =L A L L x L x ; putting: T =L x y and 1 1( )T− − =L A L A gives: λ=Ay y The matrix A is symmetric if A and B are symmetric and the eigenvalues are not modified. The eigenvectors x can be obtained from y simply by back-substitution. However, if the matrices A and B are sparse, this method is not convenient because A will generally be dense. In the case of sparse matrices, it is more convenient to solve the generalized problem directly. Solution Methods We can again classify solution methods as: direct and iterative. Direct or transformation methods find all eigenvectors and eigenvalues in a fixed number of steps and they are not suitable to large and sparse matrices. These methods include i) Jacobi rotations Converts the matrix of (5.28) to diagonal form ii) QR (or QZ for complex) Converts the matrices to triangular form iii) Conversion to Tri-diagonal Normally to be followed by either i) or ii) We will not examine any of these in detail. 5.6 Vector Iteration Methods: These are better suited for sparse matrices and also for the case when only a few eigenvalues and eigenvectors are needed. 5.6.1 The Power Method: or Direct iteration For the standard problem (5.28) the algorithm consists of the repeated multiplication of a starting vector by the matrix A, this can be seen to produce a reinforcement of the component of the trial vector along the direction of the eigenvector of largest absolute value, causing the trial vector to converge gradually to that eigenvector. The algorithm can be described schematically by: Fig. 5.5 ELEC 0030 Numerical Methods page 61 © University College London 2019 – F.A. Fernandez The normalization step is necessary because otherwise the iteration vector can grow indefinitely in length over the iterations. How does this algorithm work? If φi, i = 1, .. N are the eigenvectors of A, we can write any vector of N components as a superposition of them (they constitute a base in the space of N dimensions). In particular, for the starting vector: ∑ = = N i ii 1 0 φαx (5.30) When we multiply by A we get: 1 0=x Ax or: ∑∑∑ === === N i iii N i ii N i ii 111 1 ~ φλαφαφα AAx (5.31) If λ1 is the eigenvalue of largest absolute value, we can also write this as: ∑ = = N i i i i 1 1 11 ~ φ λ λ αλx This is then normalized by: 1 1 1 ~ ~ x xx = Then, after n iterations of multiplications by A and normalization we will get: ∑ =       = N i i n i in C 1 1 φ λ λ αx (5.32) where C is a normalization constant. From this expression (5.32) we can see that since iλλ ≥1 , for all i, the coefficient of all φi for i ≠ 1, will tend to zero as n increases and then the vector xn will gradually converge to φ1. This method (the power method) finds the dominant eigenvector; that is, the eigenvector that corresponds to the largest eigenvalue (in absolute value). To find the eigenvalue we can see that pre-multiplying (5.28) by the transpose of φi will give: i T iii T i φφλφφ =A where i T i φφ A and i T i φφ are scalars, so we can write: i T i i T i i φφ φφ λ A = (5.33) This is known as the Rayleigh quotient and can be used to obtain the eigenvalue from the eigenvector. This expression has interesting properties; if we only know an estimate of the eigenvector, (5.33) will give an estimate of the eigenvalue with a higher order of accuracy than that of the eigenvector itself. Exercise 5.2 Write a short program to calculate the dominant eigenvector and the corresponding eigenvalue of a matrix like that in (7.30) but of order 7, using the power method. Terminate iterations when the relative difference between two successive estimates of the eigenvalue are within a tolerance of 10–6. Exercise 5.3 (advanced) Using the same reasoning used to explain the power method for the standard eigenvalue problem page 62 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez (5.30)-(5.32), develop the corresponding algorithm for the generalized eigenvalue problem Ax =λBx. The power method in the form presented above can only be used to find the dominant eigenvector. However, modified forms of the basic idea can be developed to find any eigenvector of the matrix. One of these modifications is the inverse iteration. 5.6.2 Inverse Iteration For the system Ax = λx we can write: xAx 11 −= λ from where we can see that the eigenvalues of A–1 are 1/λ; the reciprocals of the eigenvalues of A and the eigenvectors are the same. In this form, if we are interested in finding the eigenvector of A corresponding to the smallest eigenvalue in absolute value (closest to zero), we can notice that for that eigenvalue λ, its reciprocal is the largest, and so it can be found using the power method on the matrix A–1. The procedure then is as follows: 1. Choose a starting vector x0. 2. Find 11 0 −=x A x or instead, solve: 1 0=A x x (avoiding the explicit calculation of A–1) 3. Normalize: 1 1C=x x (C is a normalization factor to have: 1 1=x ) 4. Re-start Exercise 5.4 Write a program and calculate by inverse iteration the smallest eigenvalue of the tridiagonal matrix A of order 7 where the elements are –4 in the main diagonal and 1 in the subdiagonals. You can use the algorithms given in section 2 of the Appendix for the solution of the linear system of equations. An important extension of the inverse iteration method allows finding any eigenvalue of the spectrum (spectrum of a matrix: set of eigenvalues) not just that of smallest eigenvalue in absolute value. 5.6.3 Shifted Inverse Iteration Suppose that the system Ax = λx has the set of eigenvalues { λi}. If we construct the matrix A~ as: σ= −A A I where I is the identity matrix and σ is a real number, we have: ( )σ λ σ λ σ= − = − = −Ax Ax x x x x (5.34) And we can see that the matrix A~ has the same eigenvectors as A and its eigenvalues are {λ−σ}, that is, the same eigenvalues as A but shifted by σ. Then, if we apply the inverse iteration method to the matrix A~ , the procedure wi ll yield the eigenvalue (λi−σ) closest to zero; that is, we can find the eigenvalue λi of A closest to the real number σ. 5.6.4 Rayleigh Iteration Another extension of this method is known as Rayleigh iteration. In this case, the Rayleigh quotient is used to calculate an estimate of the eigenvalue at each iteration and the shift is updated using this value. ELEC 0030 Numerical Methods page 63 © University College London 2019 – F.A. Fernandez Since the convergence rate of the power method depends on the relative value of the coefficients of each eigenvector in the expansion of the trial vector during the iterations (as in (5.30)) and these are affected by the ratio between the eigenvalues λi and λ1, the convergence will be fastest when this ratio is largest as we can see from (5.32). The same reasoning applied to the shifted inverse iteration method leads to the conclusion that the convergence rate will be fastest when the shift is chosen as close as possible to the target eigenvalue. In this form, the Rayleigh iteration has faster convergence than the ordinary shifted inverse iteration. Exercise 5.5 Write a program using shifted inverse iteration to find the eigenvalue of the matrix A of the previous exercise which lies closest to 3.5. Then, write a modified version of this program using Rayleigh quotient update of the shift in every iteration (Rayleigh iteration). Compare the convergence of both procedures (by the number of iterations needed to achieve the same tolerance for the relative difference between successive estimates of the eigenvalue). Use a tolerance of 10–6 in both programs. page 64 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez 6. Numerical Differentiation and Integration 6.1 Numerical Differentiation For a function of one variable f(x), the derivative at a point x = a is defined as: 0 ( ) ( )'( ) lim h f a h f af a h→ + − = (6.1) This suggests that choosing a small value for h the derivative can be reasonably approximated by the forward difference: ( ) ( )'( ) f a h f af a h + − ≈ (6.2) An equally valid approximation can be the backward difference: ( ) ( )'( ) f a f a hf a h − − ≈ (6.3) Graphically, we can see the meaning of each of these expressions in the figure. The derivative at xc is the slope of the tangent to the curve at the point C. The slope of the chords between the points A and C, and B and C are the values of the backward and forward difference approximations to the derivative, respectively. We can see that a better approximation to the derivative is obtained by the slope of the chord between points A and B, labelled “central difference” in the figure. Fig. 6.1 We can understand this better analysing the error in each approximation by the use of Taylor expansions. Considering the expansions for the points a+h and a−h: (2) (3) 2 3( ) ( )( ) ( ) '( ) 2! 3! f a f af a h f a f a h h h+ = + + + +  (6.4) (2) (3) 2 3( ) ( )( ) ( ) '( ) 2! 3! f a f af a h f a f a h h h− = − + − +  (6.5) where in both cases the reminder (error) can be represented by a term of the form: (4) 4( ) 4! f hξ (see Appendix), so we could write instead (2) (3) 2 3 4( ) ( )( ) ( ) '( ) ( ) 2! 3! f a f af a h f a f a h h h O h+ = + + + + where the symbol O(h4) means: “a term of the order of h4 ” Truncating (6.4) to first order we can then see that 2( ) ( ) '( ) ( )f a h f a f a h O h+ = + + , so for the forward difference formula we have: ELEC 0030 Numerical Methods page 65 © University College London 2019 – F.A. Fernandez ( ) ( )'( ) ( )f a h f af a O h h + − = + (6.6) and we can see that the error of this approximation is of the order of h. A similar result is obtained for the backward difference. We can also see that subtracting (6.4) and (6.5) and discarding terms of order of h3 and higher we can obtain a better approximation: 3( ) ( ) 2 '( ) ( )f a h f a h f a h O h+ − − = + From where we obtain the central difference formula: 2( ) ( )'( ) ( ) 2 f a h f a hf a O h h + − − = + (6.7) which has an error of the order of h2. Expressions (6.2) and (6.3) are “two-point” formulae for the first derivative. Many more different expressions with different degrees of accuracy can be constructed using a similar procedure and using more points. For example, the Taylor expansions for the function at a number of points can be used to eliminate all the terms except the desired derivative. Example Considering the 3 points x0, x1 = x0 + h and x2 = x0 + 2h and taking the Taylor expansions at x1 and x2 we can construct a three-point forward difference formula: (2) 2 30 1 0 0 ( )( ) ( ) '( ) ( ) 2! f xf x f x f x h h O h= + + + (6.8) (2) 2 30 2 0 0 ( )( ) ( ) '( )2 4 ( ) 2! f xf x f x f x h h O h= + + + (6.9) Multiplying (6.8) by 4 and subtracting to eliminate the second derivative term: 31 2 0 04 ( ) ( ) 3 ( ) 2 '( ) ( )f x f x f x f x h O h− = + + from where we can extract for the first derivative: 20 1 20 3 ( ) 4 ( ) ( )'( ) ( ) 2 f x f x f xf x O h h − + − = + (6.10) Exercise 6.1 Considering the 3 points x0, x1 = x0 − h and x2 = x0 − 2h and the Taylor expansions at x1 and x2 find a three-point backward difference formula. What is the order of the error? Exercise 6.2 Using the Taylor expansions for f(a+h) and f(a–h) show that a suitable formula for the second derivative is: (2) 2 ( ) 2 ( ) ( )( ) f a h f a f a hf a h − − + + ≈ (6.11) Show also that the error is O(h2). Exercise 6.3 Use the Taylor expansions for f(a+h), f(a–h), f(a+2h) and f(a–2h) to show that the following are formulae for f ’(a) and f(2)(a), and that both have an error of the order of h4: page 66 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez ( 2 ) 8 ( ) 8 ( ) ( 2 )'( ) 12 f a h f a h f a h f a hf a h − − − + + − + ≈ (2) 2 ( 2 ) 16 ( ) 30 ( ) 16 ( ) ( 2 )( ) 12 f a h f a h f a f a h f a hf a h − − + − − + + − + ≈ Expressions for the derivatives can also be found using other methods. For example, if the function is interpolated with a polynomial using, say, n points, the derivative (first, second, etc) can be estimated by calculating the derivative of the interpolating polynomial at the point of interest. Example: Considering the 3 points x1, x2 and x3 with x1 < x2 < x3 (this time, not necessarily equally spaced) and respective function values y1, y2 and y3, we can use the Lagrange interpolation polynomial to approximate y(x): 1 1 2 2 3 3( ) ( ) ( ) ( ) ( )f x L x L x y L x y L x y≈ = + + (6.12) where: 2 3 1 1 2 1 3 ( )( )( ) ( )( ) x x x xL x x x x x − − = − − , 1 32 2 1 2 3 ( )( )( ) ( )( ) x x x xL x x x x x − − = − − and 1 23 3 1 3 2 ( )( )( ) ( )( ) x x x xL x x x x x − − = − − (6.13) so the first derivative can be approximated by: 1 1 2 2 3 3'( ) '( ) '( ) '( ) '( )f x L x L x y L x y L x y≈ = + + which is: 2 3 1 3 1 21 2 3 1 2 1 3 2 1 2 3 3 1 3 2 2 2 2'( ) ( )( ) ( )( ) ( )( ) x x x x x x x x xf x y y y x x x x x x x x x x x x − − − − − − ≈ + + − − − − − − . This general expression will give the value of the derivative at any of the points x1, x2 or x3. For example, at x1: 1 2 3 1 3 1 21 1 2 3 1 2 1 3 2 1 2 3 3 1 3 2 2'( ) ( )( ) ( )( ) ( )( ) x x x x x x xf x y y y x x x x x x x x x x x x − − − − ≈ + + − − − − − − (6.14) Exercise 6.4 Show that if the points are equally spaced by the distance h in (6.12) and the expression is evaluated at x1, x2 and x3, the expression reduces respectively to the 3-point forward difference formula (6.10), the central difference and the 3-point backward difference formulae. Central Difference Expressions for derivatives The following expressions are central difference approximations for different derivatives: i) 1 1( ) ( )'( ) 2 i i i f x f xf x h + −−= ii) 1 12 ( ) 2 ( ) ( )''( ) i i ii f x f x f xf x h + −− + = ELEC 0030 Numerical Methods page 67 © University College London 2019 – F.A. Fernandez iii) 2 1 1 23 ( ) 2 ( ) 2 ( ) ( )'''( ) 2 i i i i i f x f x f x f xf x h + + − −− + −= iv) (4) 2 1 1 24 ( ) 4 ( ) 6 ( ) 4 ( ) ( )( ) i i i i ii f x f x f x f x f xf x h + + − −− + − += Naturally, many more expressions can be developed using more points and/or different methods. Exercise 6.5 Derive expressions iii) and iv) above. What is the order of the error for each of the 4 expressions above? 6.1.1 Partial Derivatives For a function of two variables: f(x,y), the partial derivative: ( , )f x y x ∂ ∂ is defined as: 0 ( , ) ( , ) ( , )lim h f x y f x h y f x y x h→ ∂ + − = ∂ , which clearly, is a function of y. Again, we can approximate this expression by a difference assuming that h is sufficiently small. Then, for example a central difference expression for ( , )f x y x ∂ ∂ is: ( , ) ( , ) ( , ) 2 f x y f x h y f x h y x h ∂ + − − = ∂ (6.15) Similarly, ( , ) ( , ) ( , ) 2 f x y f x y h f x y h y h ∂ + − − = ∂ In this form, the gradient of f is given by: ˆ ˆ( , ) ( , ) [ ( , ) ( , )] [ ( , ) ( , )]ˆ ˆ( , ) 2 f x y f x y f x h y f x h y x f x y h f x y h yf x y x y x y h ∂ ∂ + − − + + − − ∇ = + ≈ ∂ ∂ Exercise 6.6 Using central difference formulae for the second derivatives derive an expression for the Laplacian ( 2 f∇ ) of a scalar function f. 6.2 Numerical Integration In general, numerical integration methods approximate the definite integral of a function f by a weighted sum of function values at several points in the interval of integration. In general these methods are called “quadrature” and there are different methods to choose the points and the weights. 6.2.1 Trapezoid Rule The simplest method to approximate the integral of a function is the trapezoid rule. In this case, the interval of integration is divided into a number of subintervals on which the function is simply approximated by a straight line as shown in Fig. 6.2. page 68 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez The integral (area under the curve) is then approximated by the sum of the areas of the trapezoids based on each subinterval. The area of the trapezoid with base in the interval [xi, xi+1] is: 11 ( )( ) 2 i i i i f fx x ++ + − and the total area is then the sum of all the terms of this form. If we denote by 1( )i i ih x x+= − Fig. 6.2 the width of the subinterval [xi, xi+1], the integral can be approximated by: 1 1 1 1 1( ) ( ) 2 nx n i i i ix f x dx h f f − + = ≈ +∑∫ (6.16) which can be written as: 1 1 1 1 2 1 2 3 2 3 1 2 12 1 ( ) [ ( ) ( ) ( ) ] nx n n n n n n i i ix f x dx f h f h h f h h f h h f h fω− − − = ≈ + + + + + + + + = ∑∫  with: 1 1 1 2 1 ( ) 2 2, , 1 2 i i i n h i h h i n h i n ω − − = = + = −  =  If all the subintervals are of the same width (the points are equally spaced), this reduces to: 1 1 1 2 1 2 1 or ( ) ; 2, , 12 nx n n n i i i i i ix h i nf ff x dx h f f h i n ω ω − = = =  + ≈ + = =   = −  ∑ ∑∫  (6.17) Exercise 6.7 (a) Using the trapezoid rule and 10 and 100 subintervals (11 and 101 points, respectively) calculate the integral of exp(−x2) between 0 and 2 and compare the results. (b) Calculate the integral of 1/x between 1 and 2 using 11 and 101 points. It can be shown that the error incurred with the application of the trapezoid rule in one interval is given by the term: 3 (2)( ) ( ) 12 b aE f ξ−= − (6.18) where ξ is a point inside the interval and the error is defined as the difference between the exact integral (I) and the area of the trapezoid (A): E = I − A. If this is applied to the Trapezoid rule using a number of subintervals, the error term changes to: ELEC 0030 Numerical Methods page 69 © University College London 2019 – F.A. Fernandez 2 (2)( ) ( ) 12 h b a hE f ξ−= − (6.19) where now hξ is a point in the complete interval [a, b] and depends on h. Considering the Error as the sum of the individual errors in each subinterval, we can write this term in the form: 3 2 (2) (2) 1 1 ( ) ( ) 12 12 n n i i i i h hE f hfξ ξ = = = − = −∑ ∑ (6.20) where iξ are points in each subintervals. Expression (6.20), in the limit when n → ∞ and 0h → corresponds to the integral of f(2) over the interval [a, b]. Then, we can write (6.20) as: 2 ( '( ) '( )) 12 hE f b f a≈ − − (6.21) Two options are open now, we can use this term to estimate the error incurred or equivalently, to determine the number of equally spaced points required for a determined precision or, include this term in the calculation to form a corrected form of the trapezoid rule: 21 1 2 ( ) ( '( ) '( )) 2 12 b n n i ia f f hf x dx h f f b f a − =  + ≈ + − −    ∑∫ (6.22) Example For the function: ( ) sin( )f x x xπ= the exact value of the integral is: 1 0 1( ) 0.31830988618379f x dx π = =∫ Calculating the integral with five subintervals the results without and with the correction term are: Integral Error % 0.30855052604273 3.0660 % 0.31378651379871 1.4211 % 6.2.2 Simpson’s Rule In the case of the trapezoid rule, the function is approximated by a straight line and this can be done repeatedly by subdividing the interval. A higher degree of accuracy using the same number of subintervals can be obtained with a better approximation than the straight line. For example, choosing a quadratic approximation could give a better result. This is the Simpson’s rule. Consider the function f(x) and the interval [a, b]. Defining the points x0, x1 and x2, as: 0 1 2, , 2 a bx a x x b+= = = and defining ( ) 2h b a= − Using Lagrange interpolation to generate a second order polynomial approximation to f(x) gives as in (6.12): 0 0 1 1 2 2( ) ( ) ( ) ( ) ( ) ( ) ( )f x f x L x f x L x f x L x≈ + + page 70 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez where: 1 20 1 22 0 1 0 2 ( )( ) 1( ) ( )( ) ( )( ) 2 x x x xL x x x x x x x x x h − − = = − − − − 0 21 0 22 1 0 1 2 ( )( ) 1( ) ( )( ) ( )( ) x x x xL x x x x x x x x x h − − = = − − − − − and 0 12 0 12 2 0 2 1 ( )( ) 1( ) ( )( ) ( )( ) 2 x x x xL x x x x x x x x x h − − = = − − − − Then, ( ) 2 0 0 1 1 2 2( ) ( ) ( ) ( ) ( ) ( ) ( ) b a h a a f x dx f x L x f x L x f x L x dx + ≈ + +∫ ∫ (6.23) or 2 2 2 0 0 1 1 2 2( ) ( ) ( ) ( ) ( ) ( ) ( ) b a h a h a h a a a a f x dx f x L x dx f x L x dx f x L x dx + + + ≈ + +∫ ∫ ∫ ∫ The individual integrals are: 2 2 0 1 22 1( ) ( )( ) 2 a h a h a a L x dx x x x x dx h + + = − −∫ ∫ and with the change of variable: 1x t x x→ = − 2 3 2 0 2 2 1 1( ) ( ) 3 2 32 2 ha h h a h h t t hL x dx t t h dt h h h + − −   = − = − =    ∫ ∫ Similarly, 2 1 4( ) 3 a h a hL x dx + =∫ and 2 2 ( ) 3 a h a hL x dx + =∫ Then, substituting in (6.23): ( ))2()(4)( 3 )( hafhafafhdxxf b a ++++≈∫ (6.24) Example Use of the Simpson’s rule to calculate the integral: 1.25 0.25 (sin 0.5)x dxπ +∫ The 2nd order Lagrange interpolation polynomial used in this case is: ELEC 0030 Numerical Methods page 71 © University College London 2019 – F.A. Fernandez 2( ) 2.8284 3.0355 0.5214p x x x= − + + We have: x0 = a = 0.25, 2 0 2 1.25x x h b= + = = , then 1 0 0.75x x h= + = and h = 0.5 The figure shows the function (in blue) and the 2nd order Lagrange interpolation (red) used to calculate the integral with the Simpson’s rule. The exact value of this integral is: 1.25 0.25 (sin 0.5) 0.950158158x dxπ + =∫ Applying the Simpson’s rule to this function gives: Fig. 6.3 ( ) 1.25 0.25 (sin 0.5) ( ) 4 ( ) ( 2 ) 0.97140452 3 hx dx f a f a h f a hπ + ≈ + + + + =∫ As with the trapezoid rule, h igher accuracy can be obtained subdividing the interval of integration and adding the result of the integrals over each subintervals. Exercise 6.8 Write down the expression for the composite Simpson’s rule using n equal subintervals and use it to calculate the integral 1.25 0.25 (sin 0.5)x dxπ +∫ of the example above using 10 subintervals. 6.2.3 The Midpoint Rule Another simple method to approximate a definite integral is the midpoint rule where the function is simply approximated by a constant value over the interval; the value of the function at the midpoint. In such a way, the integral of f(x) in the interval [a, b] is approximated by the area of the rectangle of base (b − a) and height f(c), where c = (a + b)/2. Then, for the integral: ( ) ( ) ( ) b a I f x dx b a f c Error= ≈ − +∫ (6.25) To estimate the error, let’s consider the Taylor expansion for the function f, truncated to first order and centred in the midpoint of the interval [a, b]: 21( ) ( ) ( ) ( ) '( ) (( ) )f x p x f c x c f c O x c≈ = + − + − (6.26) The error term in (6.26) is actually: 2 (2)1 1( ) ( ) ( ) 2 R x x c f ξ= − (see App. 1) so the error in (6.25) so (6.25) can be approximated as: 0 0.5 1 1.5 0 1 x0 x1 x2 page 72 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez 1 1 1( ) ( ) [ ( ) ( ) '( ) ( )] ( )( ) ( ) b b b b a a a a I f x dx p x dx f c x c f c R x dx f c b a R x dx= ≈ = + − + = − +∫ ∫ ∫ ∫ because the integral of the second term is: '( ) ( ) 0 b a f c x c dx− =∫ since ( ) 2c a b= + . Comparing with (6.25) gives: ( )(2) 2 (2) 3 (2) 3 31 1 1 1( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )2 6 6 b b b a a a Error R x dx f x c dx f x c f b c a cξ ξ ξ= = − = − = − − −∫ ∫ but since b – c = c – a = (b − a)/2, we have: 3 (2)1 ( ) ( ) 24 E b a f ξ= − (6.27) We can write then: 3 (2)1( ) ( ) ( ) ( ) ( ) 24 b a f x dx b a f c b a f ξ= − + −∫ (6.28) for some ξ in the interval [a, b]. Similarly to the case of the trapezoid rule and the Simpson’s rule, the midpoint rule can be used in a composite manner, after subdividing the interval of integration in a number of subintervals to achieve higher precision. In that case, the expression for the integral becomes: 2 (2) 1 ( )( ) ( ) ( ) 24 b n i ia h b af x dx h f c f η = − = +∑∫ (6.29) where ci are the midpoints of each of the n subintervals, h = (b − a)/n, the length of each subinterval and η is a point between a and b. Exercise 6.9 Use the Midpoint Rule to calculate the integral: 1.25 0.25 (sin 0.5)x dxπ +∫ using 2 and 10 subintervals. compare the result with that of the Simpson’s rule in the example above. 6.2.4 Gaussian Quadrature In the trapezoid, Simpson’s and midpoint rule, the definite integral of a function f(x) is approximated by the exact integral of a polynomial that approximates the function. In all these cases, the evaluation points are chosen arbitrarily, often equally spaced. However, it is rather clear that the precision attained is dependent on the position of these points, which gives another route to optimisation. After that, the weighting coefficients are determined by the choice of method. Considering again the general approach to the approximation of the definite integral, the problem can be written in the form: 1 11 ( ) ( ) ( ) n n n n i i i f x dx G f w f x =− ≈ = ∑∫ (6.30) (The interval of integration is here chosen as [−1, 1], but obviously, any other interval can be mapped into this by a change of variable.) The objective now is to find for a given n, the best choice of evaluation points nix (called here ELEC 0030 Numerical Methods page 73 © University College London 2019 – F.A. Fernandez “Gauss points”) and weights niw (“Gauss weights”) to get maximum precision in the approximation. Compared to the criterion for the trapezoid and Simpson’s rules, this is equivalent to try to find for a fixed n the best choice for nix and n iw such that the approximation (6.30) is exact for a polynomial of degree N, with N (>n) as large as possible. (That is, going beyond the degree of approximation of the previous methods.) So what we are saying is: Find nix and n iw such that: 1 11 ( ) ( ) n n n N i N i i P x dx w P x =− = ∑∫ exactly for some N > n. Choosing the coefficients of PN: 0,ia i k= ≠ , in sequence for 0, ,k N=  , we have: 1 11 ( ) n k n n k i i i x dx w x =− = ∑∫ for k = 0, 1, 2, …, N (6.31) with N as large as possible (note the equal sign now). We can simplify the problem to these functions now because any polynomial is just a superposition of terms of the form kx , so if the integral is exact for each of them, it will be for any polynomial containing those terms. Expression (6.31) is a system of equations that the Gauss points and weights (unknown) need to satisfy. The problem is then, to find the nix and n iw (n of each). This is a nonlinear problem that cannot be solved directly. We also have to determine the value of N. It can be shown that this number is N = 2n − 1. This is also rather natural since (6.31) consists of N+1 equations and we need to find 2n parameters. Finding the weights For a set of weights and Gauss points, let’s consider the Lagrange interpolation polynomials of order n, associated to each of the Gauss points: 1, ( ) nn jn i n n j j i i j x x L x x x= ≠ − = −∏ (6.32) then, since the expression (6.31) should be exact for any polynomial up to order N = 2n − 1, and ( )niL x is of order n, we have: 1 11 ( ) ( ) n n n n n i j i j j L x dx w L x =− = ∑∫ for each i = 1, .. , n (6.33) but since the ( )niL x are interpolation polynomials, ( ) n n j i j iL x δ= (that is, they are 1 if j = i and 0 otherwise), all the terms in the sum in (6.33) are zero except for j = i and we have: 1 1 ( )n ni iL x dx w − =∫ (6.34) With this, we have the weights for a given set of Gauss points. We need to find now the best choice for these. If P(x) is an arbitrary polynomial of degree ≤ 2n − 1, we can write: P(x) = Pn(x) Q(x) + R(x); (Q and R are respectively the quotient polynomial and reminder polynomial of the division of P by Pn). Pn(x) is of order n and then, Q and R are of degree n − 1 or less. Then we have: page 74 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez 1 1 1 1 1 1 ( ) ( ) ( ) ( )nP x dx P x Q x dx R x dx − − − = +∫ ∫ ∫ (6.35) If we now define the polynomial Pn(x) by its roots and choose these as the Gauss points: ∏ = −= n i n in xxxP 1 )()( , (6.36) then, the integral of the product Pn(x) Q(x), which is a polynomial of degree ≤ 2n − 1, must be given exactly by the quadrature expression: 1 11 ( ) ( ) ( ) ( ) n n n n n i n i i i P x Q x dx w P x Q x =− = ∑∫ (6.37) but since the Gaussian points are the roots of Pn(x), (6.37) must be zero; that is: 1 1 ( ) ( ) 0nP x Q x dx − =∫ (6.38) for all polynomials Q(x) of degree n − 1 or less. This implies that Pn(x) must be a member of a family of orthogonal polynomials.1. In particular, Legendre polynomials are a good choice because they are orthogonal in the interval [−1, 1] with a weighting function w(x) = 1. Going back now to the integral of the arbitrary polynomial P(x) of degree ≤ 2n − 1, and equation (6.35), we have that if we choose Pn(x) as in (6.36), (6 38) is satisfied and then, (6.35) reduces to: 1 1 1 1 ( ) ( )P x dx R x dx − − =∫ ∫ (6.39) but since R(x) is of degree n − 1 or less, the interpolation using Lagrange polynomials for the n Gauss points will give the exact representation of R (see Exercise 4.2). That is: 1 ( ) ( ) ( ) n n n i i i R x R x L x = = ∑ exactly. (6.40) Then, 1 1 1 1 11 1 1 ( ) ( ) ( ) ( ) ( ) n n n n n n i i i i i i R x dx R x L x dx R x L x dx = =− − − = =∑ ∑∫ ∫ ∫ (6.41) but we have seen before, in (6.34), that the integral of the Lagrange interpolation polynomia
l for point i is the value of niw , so: 1 11 ( ) ( ) n n n i i i R x dx w R x =− = ∑∫ (6.42) Now, since P(x) = Pn(x) Q(x) + R(x) and ( ) 0nn iP x = (see 6.36), ( ) ( ) n n i iP x R x= and from (6.39): 1 Remember that for orthogonal polynomials in [−1, 1], all their roots lie in [−1, 1], and satisfy: 1 1 ( ) ( ) ji j ip x p x dx δ − =∫ . Additionally, 1 1 ( ) ( ) 0ip x q x dx − =∫ for any polynomial q of degree ≤ i −1. ELEC 0030 Numerical Methods page 75 © University College London 2019 – F.A. Fernandez 1 11 ( ) ( ) n n n i i i P x dx w P x =− = ∑∫ (6.43) which tells us that the integral of the arbitrary polynomial P(x) of order 2n − 1 can be calculated exactly using the set of Gauss points nix , the roots of the nth order Legendre polynomial and the weights niw determined by (6.34) – the integral over the interval [−1, 1] of the Lagrange polynomial of order n corresponding to the Gauss point nix . Back to the start then, we have now a set of n Gauss points and weights that yield the exact evaluation of the integral of a polynomial up to degree 2n − 1. These should also give the best result to the integral of an arbitrary function f: 1 11 ( ) ( ) n n n i i i f x dx w f x =− ≈ ∑∫ (6.44) Gauss nodes and weights for different orders are given in the following table. Gaussian Quadrature: Nodes and Weights n Nodes nix Weghts n iw 1 0.0 2.0 2 3 3± = ±0.577350269189 1.0 3 0 8 9 = 0.888888888889 15 5± = ±0.774596669241 5 9 = 0.555555555556 4 525 70 30 35± − =±0.339981043585 (18 30) 36+ = 0.652145154863 353070525 +± =±0.861136311594 (18 30) 36− = 0.347854845137 5 0 128 225 = 0.568888888889 5 2 10 7 3± − = ±0.538469310106 (322 13 70) 900+ = 0.478628670499 5 2 10 7 3± + = ±0.906179845939 (322 13 70) 900− = 0.236926885056 6 ±0.238619186 0.4679139 ±0.661209386 0.3607616 ±0.932469514 0.1713245 Example For the integral: 1 1 sin 5xe x dx− − ∫ the results of the calculation using Gauss quadrature are listed in the table: page 76 ELEC0030 Numerical Methods © University College London 2019 – F.A. Fernandez n Integral Error % 2 −0.307533965529 3 0.634074857001 4 0.172538331616 28.71 5 0.247736352452 −2.35 6 0.241785750244 0.10 The error is also calculated compared with the exact value: 0.24203832101745. The results with few Gauss points are not very good because the function varies strongly in the interval of integration. However, for n = 6, the error is very small. Exercise 6.10 Compare the results of the example above with those of the Simpson’s and Trapezoid Rules for the same number of points. Change of Variable The above procedure was developed for definite integrals over the interval [−1, 1]. Integrals over other intervals can be calculated after a change of variables. For example if the integral to calculate is: ( ) b a f t dt∫ To change from the interval [a, b] to [−1, 1] the following change of variable can be made: ( ) 2 ( ) 2 t a bx b a − + ← − or 2 2 b a a bt x− +→ + so 2 b adt dx−= Then, the integral can be approximated by: 1 ( ) 2 2 2 b n n n i i ia b a b a a bf t dt w f x = − − + ≈ +    ∑∫ (6.45) Exercise 6.11 Use Gaussian quadrature to calculate: ∫ + 25.1 25.0 )5.0(sin dxxπ The same ideas can be extended to other types of integrals as for example, surface integrals on quadrilateral or triangular domains. The latter is commonly used in the solution of problems using the finite element method. page 77 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez 7. Solution of Partial Differential Equations 7.1 Solution of Ordinary Differential Equations (ODEs) – Initial Value Problems – (IVP) The solution of ordinary differential equations (ODEs) and in particular, initial value problems (IVPs) can be implemented using methods derived from the techniques described before for numerical differentiation and integration. In general, any first order ODE for a function y(t) can be written in the form: ‘ ( , )y f t y= , (7.1) That is, expressing the derivative as a function of the independent variable, t, and the function, y. For example: the equation: ‘ 0y aty b+ + = will result in ‘ ( , ) ( )y f t y aty b= = − + with an initial condition: 0 0( )y t y= . The general idea for a solution method is to start with a discretisation of the domain of interest defining a step h in the variable t, so that: 0 , 1, 2, ,it t ih i N= + =  . Since we know the value of y at 0t t= , we look for a way to predict the value at 1t t= and so forth. Denoting ( )iy t as iy , the exact calculation is based on the expression: 1 1 1 1 1’ ‘ ( , ) i i i i i i t t t i i i i i t t t y dt y y y y y dt y f t y dt + + + + += − ⇒ = + = +∫ ∫ ∫ (7.2) Then, the main concern is calculating the value of the integral in each subinterval. The simplest form of doing this is with the Euler method. Although this method is rarely used in practice due to its inferior performance, it is useful to describe the basic ideas on which many other practical methods are based. 7.1.1 Euler Method (Explicit or Forward Euler) Using a Taylor expansion truncated to the first order term, or the expression for the forward difference approximation to the derivative, we have: 2( ) ( ) ‘( ) ( )i i iy t h y t hy t O h+ = + + (7.3) From where the iterations can be established as: 0 0 1, ( , ), 1, 2, ,i i i iy y y y hf t y i N+= = + =  (7.4) In this case, the integral is simply approximated by the product ‘( )ihy t . We can see this graphically in Fig. 7.1. The method is very simple but it is not very accurate. From (7.3) we can see that the truncation error in one step is 2( )O h (local truncation error). However, this is considering that we do start the interval with the correct value. In practice, the error will accumulate (it is not just the sum of the errors). Ignoring this effect, the total, global error over the domain of N subintervals will be 2( )NO h , but N L h= so the global error is at best ( )O h . Fig. 7.1 Approximation of the integral ELEC 0030 Numerical Methods page 78 © University College London 2019 – F.A. Fernandez Example The so-called logistic equation, which describes the rate of change of a population is an example of a first order initial value ODE: ‘ (1 )y cy y= − with a starting population of 0 0( )y t y= . It shows that the rate of change of a population is proportional to the current size of the population and the remaining capacity (1−y). Discretising the variable t in the form: 0 , 1, 2, , 1it t ih i N= + = − the iterations start with 0 0( )y t y= and then: 1 (1 ), 1, 2, , 1i i i iy y hcy y i N+ = + − = − . Fig. 7.2 shows the solution of this equation for two different step sizes and two different starting conditions for c = 3. It can be seen that if the starting population is of 10%, it will grow eventually to the maximum capacity. If the starting condition is of a value greater than the capacity of the environment, the population will decrease to that value. The figure also show the difference in the calculations when different step sizes are used. The red line shows a coarse discretisation involving 10 subintervals and the black line shows the results for a calculation using 20 subintervals. 7.1.2 Implicit and Explicit Methods The form of the Euler method just described and many variations are called explicit because the new approximation 1iy + can be calculated as an explicit expression of the previous values. Although the calculations are simpler, solving some problems they are unstable unless extremely small increments are used. Systems that present this problem are normally called “stiff” and explicit methods don’t work properly solving them. An exact, unambiguous characterisation of stiff differential equations i
s not easy to establish but they are normally associated with situations where small variations in some parameters lead to large changes in the solution. However, this is a property of the differential equation and not of the problem, which can be formulated in a non- stiff form, without these difficulties. 7.1.3 Implicit or Backward Euler Method An implicit form of the Euler method can be derived starting from the backward difference formula or for the Taylor expansion for 1ky − 21 1( ) ( ) ‘( ) ( ) ( , )k k k k k k k ky y t h y t hy t O h y y hf t y− −= − = − + ⇒ = − Changing 1k k→ + : 1 1 1( , )k k k ky y hf t y+ + += + (7.5) The problem now is that the value of 1ky − in the r.h.s. is not known and need to be estimated. Fig. 7.2 Numerical solution of the logistic equation with two different step sizes page 79 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez Eqn. (7.5) is a normally a nonlinear equation in 1ky + and needs to be solved in each iteration. For example, for the equation: ‘ sin( ) 0y t y− = the function f is: ( , ) sin( )f t y t y= and the iterations are: 1 1 1sin( )k k k ky y ht y+ + += + Rewriting this as: 1 1( )sin( ) 0k k k ky h t h y y+ +− + − = or 1( ) 0, kf x x y += = At each step k the values of ky , kt and h are known and the equation can be solved by iterations with a root finding procedure like the fixed point iteration method or Newton-Raphson. Formulated with the fixed point iteration method this could be: 1 ( )sin( )j jk kx y h t h x + = + + and a suitable starting point could be: 0 kx y= If solved with Newton Raphson, we have: ( ) ( )sin( )k kg x x h t h x y= − + − and ‘( ) 1 ( )cos( )kg x h t h x= − + and the iterations are: 1 ( ) ‘( ) j j j j g xx x g x + = − Example For the equation: ‘ 0, (0) 1y ay y+ = = , ( , )f t y ay= − and the iterations are: 1 1k k ky y ahy+ += − or 1 1 1k k y y ah+ = + In this case the resultant equation is linear and can be solved without iterations. Fig. 7.3 shows a comparison between the results for a = 10 from the Explicit Euler method (red) and the Implicit Euler (blue). The black line shows the exact solution. Results shown with circles are for a = 10 and h = 0.195. The red line marked with “x” corresponds to the Explicit Euler method with a step size of 0.05. For step sizes larger than 0.2 the oscillations observed with the Explicit Euler method increase. For this method the iterations become: 1 (1 )k ky ah y+ = − so to avoid increasing oscillations, the step size should satisfy the condition: |1 | 1 2ah h a− < → < . Fig. 7.3 Convergence of Explicit and Implicit Euler methods ELEC 0030 Numerical Methods page 80 © University College London 2019 – F.A. Fernandez There are no restrictions for the Implicit Euler method. 7.1.4 Heun or Explicit Trapezoid Method The Heun or explicit trapezoid method is one of the variations of the Euler method that have an improved accuracy. While in the Euler method the integral in (7.2) is simply approximated by the product of the interval length (h) with the value of the derivative at the start of the interval the trapezoid method uses an approximation based on the trapezoid rule. In the simple case of an equation of the form: ' ( )y f t= , that is, if the ODE doesn’t contain the function y itself, the Euler method would be based on: 1 ( )i i iy y hf t+ = + that is (as for any other case) the slope used for the updated value is based at the point it . If instead of using ( )ihf t to approximate the integral in (2), the trapezoid rule expression is used, we would have: 1 1[ ( ) ( )]2i i i i hy y f t f t+ += + + That is, the value used is the average between the values at the ends of the subinterval and they are both known. However, in the general case, where the function f also depends on y, the equivalent expression would be: 1 1 1[ ( , ) ( , )]2i i i i i i hy y f t y f t y+ + += + + (7.6) but the value of y at the end of the interval, 1iy + in the right hand side is not available yet so instead, an approximation for it must be used. We can use the value that the Euler method would give at that point: 1 ( , )i i i iy y hf t y+ = + This results in the expression: 1 1[ ( , ) ( , ( , )]2i i i i i i i i hy y f t y f t y hf t y+ += + + + (7.7) This method is also explicit and it can be shown that the local truncation error is 3( )O h which gives a global error of 2( )O h , compared with first order for the Euler method. An implicit version of this method can be formulated if instead of approximating 1iy + in (7.6) by the expression given by the forward Euler method, (7.6) is solved for the value of 1iy + . This, as for the Implicit Euler method, will normally mean to find the root of a nonlinear function of 1iy + . page 81 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez Example Solution of the initial value problem in the interval [0, 1]: 3' , (0) 1y ty t y= + = The iterations for the Euler method are: 31 ( , ), ( , )i i i i i i i i iy y hf t y f t y t y t+ = + = − and for the trapezoid method: 1 1 1[ ( , ) ( , )]2i i i i i i hy y f t y f t y+ + += + + Fig. 7.4 Comparison between Trapezoid and Euler methods Now, 1iy + in the last term is replaced by the value given by the Euler method approximation: 1 1[ ( , ) ( , ( , )]2i i i i i i i i hy y f t y f t y hf t y+ += + + + where 3 31 1 1 1( , ( , ) ( ( )) ;i i i i i i i i i i i if t y hf t y t y h t y t t t t h+ + + ++ = + − − = + It can be seen that even with just 10 subintervals, the results from the Trapezoid method are indistinguishable from the exact solution. 7.1.5 Midpoint method Another variation that leads to a method of order 2 is based on the use of the midpoint rule to approximate the integral in (7.2): 1 1 ' i i t i i t y y y dt + + = + ∫ The midpoint rule in this case would be: 1 ( , ) i i t t f dt h f c y + =∫ where c is the midpoint of the interval, that is: 2it h+ . In this expression, the function f needs to be evaluated at 2it c t h= = + and with ( 2)iy y t h= + which is not directly available. We can do the same as in the Trapezoid method and use the Euler method to approximate this value: ( 2) ( 2) ( , )i i i iy t h y h f t y+ = + so the full expression for the midpoint method will be: 1 ( 2, ( 2) ( , ))i i i i i iy y h f t h y h f t y+ = + + + (7.8) It can be shown that as for the trapezoid method, the local truncation error is 3( )O h and the global error is 2( )O h . 7.1.6 Taylor methods The Euler method is derived from the Taylor expansion truncated to order one (see (7.3)). In the same form, we can derive different methods using truncation to a higher degree. In this way, a Taylor method of order k uses a truncation to order k. The Euler method is a Taylor method of ELEC 0030 Numerical Methods page 82 © University College London 2019 – F.A. Fernandez order one. Using a Taylor expansion truncated to order k: 2 ( ) 1( ) ( ) '( ) ''( ) ( ) ( ) 2 ! k k k i i i i i h hy t h y t hy t y t y t O h k ++ = + + + + + we can formulate the Taylor method of order k in the following form: 0 0( )y y t= 2 ( 1) 1 ( , ) '( , ) ( , )2 ! k k i i i i i i i i h hy y hf t y f t y f t y k − + = + + + + where the derivatives of f are the total derivatives: ( , ) ( , ) ( , ) ( , )'( , ) ( , )i i i i i i i ii i i i f t y f t y f t y f t ydyf t y f t y t y dt t y ∂ ∂ ∂ ∂ = + = + ∂ ∂ ∂ ∂ (7.9) and so on. The local truncation error of these methods is 1( )kO h + and then, its global error is ( )kO h For example, the second order Taylor method will take the form: 2 1 , , ( , ) ( , ) 2 i i i i i i i i i i t y t y h f fy y hf t y f t y t y+ ∂ ∂  = + + + ∂ ∂  (7.10) Example For the Initial Value Problem: 3 0 0' , ( )y ty t y t y= + = Since, 3( , )f t y ty t= + , it foll ows from (7.9) that 2 3'( , ) 3 ( )f t y y t t ty t= + + + and then the second order Taylor method gives: 2 3 2 4 1 ( ) ( (3 ) )2i i i i i i i i i hy y h t y t y y t t+ = + + + + + + The problem with these methods is that they require the calculation of the derivatives of f(t,y), which sometimes can be complicated. Carl Runge and Wilhelm Kutta devised a procedure to avoid these calculations. 7.1.7 Runge-Kutta methods This is the name given to a whole family of methods derived from the Taylor expansion and include the Euler method (Runge-Kutta of order 1), the Trapezoid method, the midpoint method and others. Second order Runge-Kutta methods To avoid the calculation of the derivative in (7.10), the iteration equation is written as: 1 0 1 1 2( )i iy y h c k c k+ = + + (7.11) and it can be shown (see Appendix, section 6) that the coefficients satisfy the equations: page 83 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez 0 1 1 1 1 1 2 1 2 c c c p c q + = = = (7.12) These are three equations and there are four coefficients, so one of them can be chosen arbitrarily. Usually 1c is chosen and the common values are: 1 0 1 0 1 0 0 1 0 0 Euler method 1 2 1 2 1 1 Trapezoid method 1 0 1 2 1 2 Midpoint method c c p q c c p q c c p q = → = = = = → = = = = → = = = Another method, not derived here, is the Ralston’s method which is specified by the value: 1 2 3c = , which gives: 0 1 3,c = , 3 4p = and 3 4q = . Similarly, for Taylor methods of higher order, equivalent expressions can be derived to avoid the explicit calculation of the derivatives. Fourth order Runge-Kutta method (RK4) The most common of the Runge-Kutta methods is the order 4, called RK4 because its global error is 4( )O h . The derivation of the corresponding expression is rather involved and is omitted here. The iterations are specified by: 1 1 2 3 4( 2 2 )6i i hy y s s s s+ = + + + + (7.13) where: 1 2 1 3 2 4 3 ( , ) ( 2, 2 ) ( 2, 2 ) ( , ) i i i i i i i i s f t y s f t h y h s s f t h y h s s f t h y hs = = + + = + + = + + (7.14) Note that in (7.13) the term 1 2 3 4( 2 2 ) 6h s s s s+ + + is used as the estimate of the integral of the slope in the interval in a form that is more accurate than the previous methods. Programming of this procedure is straightforward and this is another reason for its popularity. Example Using the Runge-Kutta method of order 4 to solve the initial value problem: 3'y ty t= + with (0) 1y = which has the exact solution: 2 2 2( ) 3 2ty t e t= − − , gives: steps step size h error at t=1 5 0.20000000 2.378807e-05 10 0.10000000 1.465466e-06 20 0.05000000 9.035429e-08 40 0.02500000 5.598290e-09 80 0.01250000 3.482015e-10 160 0.00625000 2.170353e-11 320 0.00312500 1.338929e-12 640 0.00156250 1.050271e-13 Results calculated using the script ExampleRK4.m: L=1; % Exact value at t=1 R=3*exp(0.5)-3; ELEC 0030 Numerical Methods page 84 © University College London 2019 – F.A. Fernandez disp('steps step size h error at t=1') for n=[5,10,20,40,80,160,320,640] h=L/n; yk=1; tk=0; for k=1:n s1=functf(tk,yk); tk=tk+h/2; s2=functf(tk,yk+(h/2)*s1); s3=functf(tk,yk+(h/2)*s2); tk=tk+h/2; s4=functf(tk,yk+h*s3); yk=yk+(h/6)*(s1 + 2*s2 + 2*s3 + s4); end e=abs(R-yk); fprintf('%5d %12.8f %15.6e\n',n,h,e) end function s=functf(ti,yi) s=ti*yi+ti^3; end 7.1.8 Systems of ODEs A numerical solution of an IVP consisting of a system of ODEs can be implemented as a simple extension of the same methodology used for a single ODE as described in (7.2) above. For example, a system of first order ODEs for the functions: 1 2, , , ny y y , can be expressed by: 1 1 1, 2 2 2 1, 2 1, 2 ' ( , , , ) ' ( , , , ) ' ( , , , ) n n n n n y f t y y y y f t y y y y f t y y y = = =     (7.15) or 1 2' ( , , , , ), 1, 2, ,i i ny f t y y y i n= =  each with its own initial conditions. We can write the system in a compact form as the vector equation ' ( , )Y f t Y= where the vector Y is: 1 2( , , , ) T ny y y and ( , )f t Y is the vector 1 2( ( , ), ( , ), , ( , )) T nf t Y f t Y f t Y . The initial conditions will be given by the vector 0 1 0 2 0 0( ( ), ( ), , ( ))nY y t y t y t=  . Example Apply the Euler method to the system of fist order ODEs: 21 2 1' 2y y y= − 22 1 2 2'y y y ty= − − with the initial conditions: 1(0) 0y = and 2 (0) 1y = . The individual equations (scalar) for the Euler iterations are: 21, 1 1, 2, 1,( 2 )i i i iy y h y y+ = + − 22, 1 2, 1, 2, 2,( )i i i i i iy y h y y t y+ = + − − and a simple program can be written to perform these iterations starting from the initial conditions. page 85 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez 7.1.9 Higher order ODEs Higher order equations can also be formulated in the same form as a system of equations. For example an equation of second order such as: '' ' 0y aty by c+ + + = , with the initial conditions: 0 0( )y t y= , 0 0'( ) 'y t y= can be written as '' ( , , ')y f t y y= where ( , , ') 'f t y y aty by c= − − − . We can define: 1y y= , 2 1' ( ')y y y= = and we have the system of equations: 1 2'y y= 2 1 2 2 1' ( , , )y f t y y aty by c= = − − − with the initial conditions: 1 0 1,0( )y t y= , 2 0 2,0( )y t y= . In general, for an equation of order n we have: ( ) ( 1)( , , ', '', , )n ny f t y y y y −=  (7.16) and we can treat this as a system of equations, where the vector Y is: ( 1)( , ', '', , )n TY y y y y −=  so the system of equations becomes: ( ) ( 1) , 1, 2, , 1k ky y k n+= = − . (7.17a) and ( ) ( 1)( , , ', '', , )n ny f t y y y y −=  (7.17b) Example Convert the third order ODE: 2''' ( '') ' '' siny a y y yy t= − + + into a system of first order equations. Defining the functions: 1y y= , 2 1 'y y= and 3 2 'y y= we have the system of equations: 1 2'y y= 2 3'y y= and 23 3 2 1 3' siny ay y y y t= − + + 7.2 Solution of Partial Differential Equations 7.2.1 The Finite Difference Method: Solution of Differential Equations by Finite Differences We will study here a direct method to solve differential equations which is based on the use of finite differences. This consists of the direct substitution of derivatives by finite differences and thus the problem reduces to an algebraic form. One Dimensional Problems: Let’s consider a problem given by the equation: Lf = g (7.18) where L is a linear differential operator and g(x) is a known function. The problem is to find the function f(x) satisfying equation (7.18) over a given region (interval) [a, b] and subjected to some boundary conditions at a and b. The basic idea is to substitute the derivatives by appropriate difference formulae like those seen in section 6. This will convert the problem into a system of algebraic equations. In order to apply systematically the difference approximations we proceed as follow: ELEC 0030 Numerical Methods page 86 © University College London 2019 – F.A. Fernandez –– First, we divide the interval [a, b] into N equal subintervals of length h : h = (b–a)/N, defining the points xi : xi = a + ih –– Next, we approximate all derivatives in the operator L by appropriate difference formulae (h must be sufficiently small – N large – to do this accurately). –– Finally, we formulate the corresponding difference equation at each point xi. This will generate a linear system of N–1 equations on the N–1 unknown values of fi = f(xi). Example: If we take Lf = g to be a general second order differential equation: ''( ) '( ) ( ) ( )cf x df x ef x g x+ + = (7.19) with constant coefficients c, d and e. [ , ]x a b∈ and boundary conditions: f(a) at a and f(b) at b. (7.20) Using (6.11) and (6.7) at point i: 1 12 2'' i i ii f f ff h − +− +≈ and 1 1' 2 i i i f ff h + −−≈ (7.21) results for the equation at point i: 1 1 1 12 2 2 i i i i i i i f f f f fc d e f g hh − + + −− + −+ + = (7.22) or: ( )2 21 1 2 2 2i i i i d dc h f c eh f c h f g h− +    − + − + + + =        (7.23) for all i except i = 1 and N–1, where for i = 1: fi-1 = f(a) and for i = N–1: fi+1 = f(b) This can be written as a matrix problem of the form: A f = g, where f = { fi} g = {h2gi} are vectors of order N–1. The matrix A has only 3 elements per row: , 1 2i i da c h−  = −    , ( )2, 2i ia c eh= − + , , 1 2i i da c h+  = +    Exercise 7.1 Formulate the algorithm to solve a general second order differential equation over the interval [a, b], with Dirichlet boundary conditions at a and b (known values of f at a and b). Write a short computer program to implement it and use it to solve: f’’ + 2f’ + 5f = e–xsin(x) in [0,5]; f(0) = f(5) = 0 Two Dimensional Problems We consider now the problem Lf = g in 2 dimensions, where f is a function of 2 variables, say: x and y. In this case, we need to approximate derivatives in x and y in L. To do this, we superimpose a regular grid over the domain of the problem and (as for the 1–D case) we only consider the values of f and g at the nodes of the grid. page 87 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez Example: Referring to the figure below, the problem consists of finding the potential distribution between the inner and outer conducting surfaces of square cross-section when a fixed voltage is applied between them. The equation describing this problem is the Laplace equation: 2 0φ∇ = with the boundary conditions φ = 0 in the outer conductor and φ =1 in the inner conductor. To approximate 2 2 2 2 2x y φ φφ ∂ ∂∇ = + ∂ ∂ (7.24) we can choose for convenience the same spacing h for x and y. Ignoring the symmetry of the problem, the whole cross-section is discretized (as in Fig. 7.5). With this choice, there are 56 free nodes (for which the potential is not known). Fig. 7.5 Finite difference mesh for solution of the electrostatic field in square coaxial line. On this mesh, we only consider the unknown internal nodal values and only those nodes are numbered. An internal point of the grid, xi, labelled by O in Fig. 7.4, is surrounded by other four, which for convenience are labelled by N, S, E and W. For this point we can approximate the derivatives in (7.23) by: ( ) 2 2 2 1 2W O Ex h φ φ φ φ∂ ≈ − + ∂ and ( ) 2 2 2 1 2N O Sy h φ φ φ φ∂ ≈ − + ∂ (7.25) Then the equation 2 0φ∇ = becomes simply: ( )2 24N S E W O hφ φ φ φ φ φ∇ ≈ + + + − = 0 or φN +φS + φE + φW − 4φO = 0 (7.26) ELEC 0030 Numerical Methods page 88 © University College London 2019 – F.A. Fernandez Formulating this equation for each point in the grid and using the boundary conditions where appropriate, we end up with a system of N equations, one for each of the N free points of the grid. Applying equation (7.26) to point 1 of the mesh gives: 10 2 10 0 4 0φ φ φ+ + + − = (7.27) and to point 2: 11 3 1 20 4 0φ φ φ φ+ + + − = (7.28) A typical interior point such as 11 gives: 2 20 12 10 114 0φ φ φ φ φ+ + + − = (7.29) and one near the inner conductor, 13 will give: 4 14 12 131 4 0φ φ φ φ+ + + − = (7.30) In this way, we can assemble all 56 equations from the 56 mesh points of Fig. 7.4, in terms of the 56 unknowns. The resulting 56 equations can be expressed as: A x = y (7.31a) or 1 2 3 13 55 56 4 1 1 0 1 4 1 1 0 1 4 1 0 1 4 1 0 1 4 0 φ φ φ φ φ φ −          −          −     ⋅⋅ ⋅         ⋅⋅ ⋅ =     ⋅ −         ⋅ ⋅      ⋅ ⋅         −          −          (7.31b) The unknown vector x of (7.30a) is simply (φ1, φ2, . . . , φ56)T. The right-hand-side vector y of eqs. (7.31) consists of zeros except for the –1s coming from equations like (7.30) corresponding to points next to the inner conductor with potential 1; namely, from points 12 to 16, 20, 21, 24, 25, 28, 29, 32, 33, 36, 37, 41 to 45. The 56×56 matrix A has mostly zero elements, except for ‘–4’ on the diagonal, and either two, three or four ‘+1’ somewhere else on each matrix row, the number and distribution depending on the geometric node location. One has to be careful not to confuse the row-and-column numbers of the 2-D array A with the ‘x and y coordinates’ of the physical problem. Each number 1 to 56 of the mesh points in the figure above corresponds precisely to the row number of matrix A and the row number of column vector x. Equation (7.31) is standard, as given in (5.11), and could be solved with a standard library package, with a Gauss elimination solver routine, or better, a sparse or a band-matrix version of Gauss elimination. Better still would be the Gauss-Seidel or successive displacement. Section 2 in the Appendix shows details of such a solution. Enforcing Boundary Conditions In the description so far we have seen the implementation of the Dirichlet boundary page 89 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez condition; that is a condition where the values of the desired function are known at the edges of the region of interest (ends of the interval in 1-D or boundary of a 2-D or 3-D region). This has been implemented in a straightforward way in equations (7.27), (7.28) and (7.30). Frequently, other types of boundary conditions appear, for example, when the derivatives of the function are known (instead of the values themselves) at the boundary – this is the Neumann condition. In occasions, a mixed condition will apply; something like: ( ) ff r K n ∂ + = ∂ , where the second term refers to the normal derivative (or derivative along the normal direction to the boundary. This type of condition appears in some radiation problems (Sommerfeld radiation condition). We will see next some forms of dealing with these boundary conditions in the context of finite difference calculations. For example in the case of the ‘square coaxial problem’ studied earlier, we can see that the solution will have symmetry properties, which makes unnecessary the calculation of the potential over the complete cross-section. In fact only one eighth of the cross-section is actually needed. The new region of interest can be one eighth of the complete cross-section: the shaded region or one quarter of the cross-section, to avoid oblique lines. In any case, the new boundary conditions needed on the dashed lines that limit the new region of interest are of the Neumann type: 0 n φ∂ = ∂ since the lines of constant potential will be perpendicular to those edges. (n represents the normal to the boundary). Fig. 7.6 We will need a different strategy to deal with these conditions. For this condition it is more convenient to define the mesh in a different manner. If we place the boundary at half the node spacing from the start of the mesh as in Fig. 7.7, we can implement the normal derivative condition in a simple form: (Note that the node numbers used in Fig. 7.7 do not correspond to the mesh numbering defined earlier for the whole cross-section). Fig. 7.7 The boundary condition that applies at point b (not actually part of the mesh) is: 0 n φ∂ = ∂ We can approxi mate this by using a central difference with the point 1 and the auxiliary point a outside the mesh. Then: 10 a bn h φ φφ −∂ = = ∂ and then, 1aφ φ= . So the corresponding equation for point 1 is: φN +φS + φE + φW − 4φ0 = 0 Substituting values we have: 0 +φ10 +φ1 +φ2 − 4φ1 = 0 ELEC 0030 Numerical Methods page 90 © University College London 2019 – F.A. Fernandez or φ2 + φ10 − 3φ1 = 0 A Dirichlet condition can also be implemented in a similar form, if necessary. For example if part of a straight boundary is subjected to a Neumann condition and the rest to a Dirichlet condition (as for example in the case of a conductor that does not cover the whole boundary), a Dirichlet condition will have to be implemented in a mesh separated half of the spacing from the edge. Exercise 7.2: Consider the following figure and the point a on the boundary (not in the mesh) with the points 5 and 15 and a virtual point (shown in grey) above the boundary. Use Taylor expansions up to order 2 at the 3 points ±h/2 and 3h/2 from a, to find the value of φ on the virtual point when the function φ has a fixed value V at the boundary. This corresponds to the case where part of a boundary is subjected to a Dirichlet boundary condition while other has a Neumann condition. Fig. 7.8 Using the results of Exercise 7.2, the difference equation corresponding to the discretization of the Laplace equation at point 5 where the boundary condition is φ = V: 4 6 15 54 0Nϕ ϕ ϕ ϕ ϕ+ + + − = but from Ex. 7.2: 5 15 1 (8 6 ) 3N Vϕ ϕ ϕ= − + so: 4 6 15 5 4 86 3 3 Vϕ ϕ ϕ ϕ+ + − = − Exercise 7.3 Using Taylor expansions, derive an equation to implement the Neumann condition using five points along a line normal to the edge and at distances 0, h, 2h, 3h, and 4h. Exercise 7.4 (Advanced) Using two points like 11 and 12 in the figure of Exercise 7.2, find the equation corresponding to the implementation of a radiation condition of the form: p K n φφ ∂+ = ∂ , where p is a constant with the same dimensions as h. Repeat using 3 points along the same line and use the extra degree of freedom to increase the degree of approximation (eliminating second and third derivatives). Example: Heat conduction (diffusion) in a uniform rod. A uniform rod of length L is initially at temperature 0. It is then connected to a heat source at temperature 1 at one end while the other is attached to a sink at temperature 0. Find the temperature distribution along the rod, as time varies. We need the time-domain solution (transient). In the steady state we would expect a linear distribution of temperature between both ends. The equation describing the temperature variation in space and time, assuming a heat conductivity of 1 is the normalized diffusion equation: page 91 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez 2 2 u u tx ∂ ∂ = ∂∂ (7.32) where u(x,t) is the temperature. The boundary and initial conditions are: (0, ) 0 for all ( , ) 1 u t t u L t =  =  u(x,0) = 0 for x < L We can discretize the solution space (x.t) with a regular grid with spacing ∆x and ∆t: The solution will be sought at positions: xi, i = 1, ...., M–1 (leaving out of the calculation the points at the ends of the rod), so ∆ x = L/M) and at times: tn, n = 0, 1, 2,... . Using this discretization, the boundary and initial condition become: 0 0 for all 1 n n M u n u =   =  0 0iu = for i = 0, ...., M–1 (7.33) We now discretize equation (7.32), converting the derivatives into differences: For the time derivative at time n and position i, we can use the central difference formula: 2 1 1 2 2 , 2n n ni i i i n u u u u x x − +∂ − += ∂ ∆ (7.34) In the right hand side of (7.32), we need to approximate the time derivative at the same point (i,n). We could use the forward difference: 1n n i iu uu t t + −∂ ≈ ∂ ∆ (7.35) and in this case we get: 1 1 1 2 2n n n n ni i i i iu u u u u tx + − +− + −= ∆∆ (7.36) as the difference equation. Rearranging: 1 1 12 2 21 2 n n n n i i i i t t tu u u u x x x + − + ∆ ∆ ∆ = + − + ∆ ∆ ∆  (7.37) This gives us a form of calculating the temperature u at point i and time n+1 as a function of u at points i, i–1 and i+1, at time n. We have then a time-stepping algorithm. Fig. 7.9 Equation (7.37) is valid for n = 0,...., N and i = 1, ...., M–1. Special cases: for i = 1: 1 n iu − = 0, and for i = M–1: 1 n iu + = 1 for all n (at all times). If we call: b = 1− 2 ∆t ∆x2       and 2 tc x ∆ = ∆ , we can rewrite (7.36) as: ELEC 0030 Numerical Methods page 92 © University College London 2019 – F.A. Fernandez 1 1 1 2 1 2 1 2 3 1 1 2 1 n n n n n n n n n n M M M u bu cu u cu bu cu u cu bu c + + + − − − = + = + + = + +   (7.38) which can be written in matrix form as: 1n+u = A nu + v, where the matrix A and the vector v are: b c c b c c b c c b ⋅ ⋅   ⋅ ⋅  = ⋅ ⋅   ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅   ⋅  A and 0 0 0 c       = ⋅         v (7.39) nu is the vector containing all values niu . It is known at n = 0, so the matrix equation (7.38) can be solved for all successive time steps. Problems with this formulation: – It is unstable unless ∆ t ≤ ∆ x2/2 (c ≤ 0.5 or b ≥ 0) This is due to the rather poor approximation to the time derivative provided by the forward difference (7.35). A better scheme can be constructed using the central difference in the RHS of (7.32) instead of the forward difference. The Crank-Nicolson method: Using the central difference for the time derivative A central difference approximation to the RHS of (7.15) would be: 1 1 , 2 n n i i i n u u u t t + −∂ − = ∂ ∆ This would involve values of u at three time steps: n–1, n, and n+1, which would be undesirable. A solution to this is to shrink the gap between the two values, having instead of the difference between 1niu + and 1niu − , the difference between 1niu + and niu . However, if we consider this a ‘central’ difference, the derivative must be evaluated at the central point which corresponds to the value n+1/2: 1 , 1/2 n n i i i n u u u t t + + ∂ − = ∂ ∆ (7.40) We then need to evaluate the left hand side of (7.32) at the same time, but this is not on the grid! We will have: 2 1/2 1/2 1/2 1 1 2 2 , 1/2 2n n ni i i i n u u u u x x + + + − + + ∂ − + = ∂ ∆ (7.41) Since the values of u are restricted to positions in the grid, we have to approximate the values in the RHS of (7.41). Since those values are at the centre of the intervals, we can approximate them by the average between the neighbouring grid points: 1 1/2 2 n n n i i i u uu + + +≈ and similarly for i–1, and i+1. (7.42) page 93 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez We can now substitute (7.42) into (7.41) and evaluate the equation (7.32) at time n+1/2. After re- arranging this gives: 1 1 11 1 1 12 2 n n n n n n i i i i i iu du u u eu u + + + − + − +− + = − + − (7.43) where 2 1 xd t  ∆ = + ∆  and e = 1− ∆x2 ∆t       This form of treating the derivative (evaluating the equation between grid points in order to have a central difference approximation for the first order derivative is called the Crank-Nicolson method and has several advantages over the previous formulation (using the forward difference). Fig. 7.10 shows the scheme of calculation: the points or values involved in each calculation. The dark spot represents the position where the equation is evaluated. This method provides a second order approximation for both deriv atives and also, very important, it is unconditionally stable. n+1 n i i+1i–1 n+1/2 Fig. 7.10 Crank-Nicolson scheme We can now write (7.43) for each value of i as in (7.38), considering the special cases at the ends of the rod and at t = 0, and write the corresponding matrix form. In this case we will get: Au n+1 = Bun − 2v (7.44) where: A = −2d 1 ⋅ 1 −2d 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 −2d             , B = 2e −1 ⋅ −1 2e −1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ −1 2e             and v = 0 ⋅ ⋅ 1             Example Consider now a parabolic equation in 2 space dimensions and time, like the Schrödinger equation or the diffusion equation in 2D: 2 2 2 2 u u ua tx y ∂ ∂ ∂ + = ∂∂ ∂ (7.45) For example, this could represent the temperature distribution over a 2-dimensional plate. Let’s consider a square plate of length 1 in x and y and the following boundary conditions: a) u(0,y,t) = 1 b) u(1,y,t) = 0 c) u(x,0,t) = 1 d) u(x,1,t) = 0 e) u(x,y,0) = 0 for x > 0 and y > 0 That is, the sides x = 0 and y = 0 are kept at temperature 1 at all times while the sides x = 1 and y = 1 are kept at u = 0. The whole plate, except the sides x = 0 and y = 0 are at temperature 0 at t = 0. The following diagram shows the discretization for the x coordinate: ELEC 0030 Numerical Methods page 94 © University College London 2019 – F.A. Fernandez There are R+2 points with R unknown values (i = 1, …, R). The two extreme points: i = 0 and i = R+1 correspond to x = 0 and x = 1, respectively. The discretization of the y coordinate can be made in similar form. For convenience, we can also use R+2 points, where only R are unknown: (j = 1, …, R). Discretization of time is done by considering t = n∆ t, where ∆ t is the time step. Approximation of the time derivative (first order): As in the previous example, in order to use the central difference and only two time levels, we use the Crank-Nicolson method. That is, equation (7.45) will be evaluated half way between time grid points. This will give: 1 , , , , 1/2 n n i j i j i j n u uu t t + + −∂ = ∂ ∆ (7.46) We need to approximate the LHS at the same time using the average of the values at n and n+1: ( )2 ( 1/2) 2 ( ) 2 ( 1)12n n nu u u+ +∇ = ∇ + ∇ Applying this and (7.46) to (7.45) we get: ( )2 ( 1) 2 ( ) ( 1) ( )2n n n nau u u ut + +∇ + ∇ = − ∆ or re-writing: 2 ( 1) ( 1) 2 ( ) ( )2 2n n n na au u u u t t + +  ∇ − = −∇ + ∆ ∆  (7.47) where u is still a continuous function of position. Approximation of the space derivative (second order): Using the central difference for the second order derivatives and using ∆x = ∆y = h, we get: ( )2 2 1 4N S W E Ou u u u u uh ∇ = + + + − and using this in (7.47): 2 1 1 1 1 124n n n n nN S W E O h au u u u u t + + + + + + + + − + ∆  224n n n n nN S W E O h au u u u u t    = − + + + − −   ∆   (7.48) Defining now node numbering over the grid and a vector u(n) containing all the unknown values (for all i and j), (7.47) can be written as a matrix equation of the form: Au(n+1) = − Bu(n) (7.49) Exercise 7.5 A length L of transmission line is terminated at both ends with a short circuit. A unit impulse voltage is applied in the middle of the line at time t = 0. The voltage φ(x, t) along the line satisfies the following differential equation: page 95 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez 2 2 2 2 2 0px t φ φ∂ ∂ + = ∂ ∂ with the boundary and initial conditions: (0, ) ( , ) 0t L tφ φ= = , φ(x,t) = 0 for t < 0 and φ(L 2,0) =1 a) By discretizing both the coordinate x and time t as xi = (i – 1)∆x , i = 1, 2, .., R+1 and tm = m∆t , m = 0, 1, 2, ..., use finite differences to formulate the numerical solution of the equation above. Show that the problem reduces to a matrix problem of the form: Φm+1 = AΦm + BΦm−1 where A and B are matrices and Φm is a vector containing the voltages at each of the discretization points xi at time tm. b) Choosing R = 7, find the matrices A and B giving the values of their elements, taking special care at the edges of the grid. Show the discretized equation corresponding to points at the edge of the grid (for i = 1 or R+1) and consider the boundary condition φ(0) = φ(L) = 0. c) How would the matrices A and B change if the boundary condition was changed to 0 x φ∂ = ∂ at x = 0, L? (Corresponding in this case to an open circuit). Show the discretized equation corresponding to one of the edge points and propose a way to transform it so it contains only values corresponding to points in the defined grid. ELEC 0030 Numerical Methods page 96 © University College London 2019 – F.A. Fernandez 8. Solution of Boundary Value Problems A physical problem is often described by a differential equation of the form: ( ) ( )u x s x= L on a region Ω. (8.1) where L is a linear differential operator, ( )u x is a function to be found, ( )s x is a known function and x is the position vector of any point in Ω. We also need to impose boundary conditions on the values of u and/or its derivatives on Γ, the boundary of Ω, in order to have a unique solution to the problem. The boundary conditions can be written in general as: ( ) ( )u x t x= B on Γ. (8.2) with B, a linear differential operator and ( )t x a known function. So we will have for example: B = 1 : ( ) ( )u x t x=  : known values of u on Γ (Dirichlet condition) n ∂ = ∂ B : ( ) ( )u x t x n ∂ = ∂   : fixed values of the normal derivative (Neumann condition) k n ∂ = + ∂ B : ( )u ku t x n ∂ + = ∂  : mixed condition (Radiation condition) A problem described in this form is known as a boundary value problem (differential equation + boundary conditions). 8.1 Strong and Weak Solutions to Boundary Value Problems 8.1.1 Strong Solution: This is the direct approach to the solution of the problem, as specified above in (8.1)-(8.2); for example, the finite difference solution method studied earlier is of this type. 8.1.2 Weak Solution: This is an indirect approach. Instead of trying to solve the problem directly, we can re- formulate it as the search for a function that satisfies some conditions also satisfied by the solution to the original problem (8.1)-(8.2). With a proper definition of these conditions the search will lead to the correct and unique solution. Before expanding on the details of this search, we need to use the idea of inner product between two functions defined earlier. This will allow us to quantify (put a number to) how close a function is to another or how big or small is an error function. The commonest definition of inner product between two functions f and g is: f , g = fg dΩ∫ for real f and g or f , g = f g *dΩ∫ for complex functions f and g (8.3) In general, the inner product between two functions will be a real number obtained by global operations between the functions over the domain and satisfying some defining properties (for real functions): i) f , g = g, f ii) αf + βg,h = α f ,h + β g, h α and β scalars (8.4) page 97 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez iii) f , f = 0 if and only if f = 0 From this definition we can deduce that: If for a function r, we have that r,h = 0 for any choice of h, then 0r ≡ . We will try to use this property for testing an error residual r. We are now in position to formulate the weak solution of (8.1)-(8.2) as: Given the function ( )s x and an arbitrary function ( )h x in Ω; find the f unction ( )u x that satisfies: hshu ,, =L or 0, =− hsuL (8.5 for any choice of the function h. 8.2 Approximate Solutions to the Weak Formulation 8.2.1 The Rayleigh-Ritz Method In this case, instead of trying to find the exact solution ( )u x that satisfies (8.5), we only look for an approximation. We can choose a set of basis functions and assume that the wanted function, or rather, a suitable approximation to it, can be represented as an expansion in terms of the basis functions (as for example with Fourier expansions using sines and cosines). Now, once we have chosen the basis functions, the unknown is no longer the function u, but its expansion coefficients; that is: numbers! so the problem (8.5) is converted from “finding a function u among all possible functions defined on Ω” into the search for a set of numbers. We now need some method to find these numbers. We will see two methods to solve (84) using the Rayleigh-Ritz procedure above: Weighted Residuals Variational Method 8.2.2 Weighted Residuals For this approach the problem (8.5) is rewritten in the form: r = Lu – s = 0 (8.6) where the function r is the residual or error or the difference between the LHS and the RHS of (8.1) when we try any function in place of u. This residual will only be zero when the function we use is the correct solution to the problem. We can now look at the solution to (8.6) as an optimization (or minimization) problem; that is: “Find the function u such that r = 0” or in approximate way: “Find the function u such that r is as small as possible” and here we need to be able to measure how ‘small’ is the error (or residual). Using these ideas on the weak formulation, this is now transformed into: Given the function ( )s x and an arbitrary function ( )h x in Ω; find the function ( )u x that satisfies: r, h = 0 (where r = Lu – s) (8.7) for any choice of the function h. Applying the Rayleigh-Ritz procedure, we can put: 1 ( ) ( ) N j j j u x d b x = = ∑  (8.8) ELEC 0030 Numerical Methods page 98 © University College London 2019 – F.A. Fernandez where dj, j = 1, ... , N are scalar coefficients and ( )jb x  , j = 1, ... , N are a set of linearly independent basis functions (normally called trial functions – chosen) We now need to introduce the function h. For this, we can also use an expansion, in general using another set of functions ( )iw x  for example: 1 ( ) ( ) N i i i h x c w x = = ∑  (8.9) where ci, i = 1, ... , N are scalar coefficients and ( )iw x  , i = 1, ... , N a set of linearly independent expansion functions for h (normally called weighting functions – also chosen) Using now (8.9) into (8.7) we get: 1 1 , ( ), ( ) ( ), ( ) 0 N N i i i i i i r h r x c w x c r x w x = = = = =∑ ∑    (8.10) for any choice of the coefficients ci. Since (8.10) has to be satisfied for any choice of the coefficients ci (equivalent to say ‘any choice of h), we can deduce that: , 0ir w = for all i, i = 1, ... , N (8.11) which is simpler than (8.10). (It comes from the choice: ci = 1 and all others = 0, in sequence). So, we can conclude that we don’t need to test the residual against any (and all) possible functions h; we only need to test it against each of the chosen weighting functions wi, i = 1, ... , N. The weak formulation can now be expressed as: Given the function ( )s x and the weighting functions ( )iw x  in Ω; find the function ( )u x that satisfies: , 0ir w = for all i, i = 1, ... , N (where r = Lu – s) (8.11) Now, we can expand the function u in terms of the trial (basis) functions as in (8.8) and use this expansion in r: r = Lu – s. With this, (8.11) becomes: 0)( ,)()(,, 1 =−=−= ∑ = xwxsxbdwsuwr i N j jjii LL for all i or 1 , , , 0 N i j j i i j r w d b w s w = = − =∑ L for all i (8.12) Note that in this expression the only unknowns are the coefficients dj. We can rewrite this expression (8.12) as: aij dj j =1 N ∑ = si for all i: i = 1, N which can be put in matrix notation as: page 99 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez A d = s (8.13) where A = { aij} , d = { dj} and s = { si} with aij = ij wb ,L and , i is s w= Since the trial functions bj are known, the matrix elements aij are all known and the problem has been reduced to solve (8.13), finding the coefficients dj of the expansion of the function u. Different choices of the trial functions and the weighting functions define the different variations by which this method is known: i) wi = bi –––––––> method of Galerkin ii) ( ) ( )i iw x x xδ= −    –––––––> Point matching method (or collocation method) This is equivalent to asking for the residual to be zero at a fixed number of points on the domain. 8.2.3 Variational Method The central idea of this method is to find a functional* (or variational expression) associated to the boundary value problem and for which the solution of the BV problem leads to a stationary value of the functional. Combining this idea with the Rayleigh-Ritz procedure we can develop a systematic solution method. Before introducing numerical techniques to solve variational formulations, let’s examine an example to illustrate the nature of the variational approach: Example: The problem is to find the path of light travel between two points. We can start using a variational approach and for this we need a variational expression related to this problem. In particular, for this case we can use Fermat’s Principle that says that ‘light travels through the quickest path’ (Note that this is not necessarily the shortest). We can formulate this statement in the form: time = min ds v(s) P1 P2 ∫ (8.14) where v(s) is the velocity point by point along the path. We are not really interested in the actual time taken to go from P1 to P2, but on the conditions that this minimum imposes on the path s(x,y) itself. That is, we want to find the actual path between these points that minimize this time. We can write for the velocity: v = c/n, where n(x,y) is the refractive index. Let’s consider first a uniform medium, that is, one with n uniform (constant). In that case, (8.14) becomes: * A functional is simply a function of a function over a complete domain which gives as a result a number. For example: 2( ) dJ φ φ Ω = Ω∫ . This expression gives a numerical value for each function φ we use in J(φ). Note that J is not a function of x . ELEC 0030 Numerical Methods page 100 © University College London 2019 – F.A. Fernandez time = n c min ds P1 P2 ∫ = n c min( path length) (8.15) The integral in this case reduces to the actual length of the path, so the above statement asks for the path of ‘minimum length’ or the shortest path between P1 to P2. Obviously, the solution in this case is the straight line joining the points. However, you can see that (8.14) can be applied to an inhomogeneous medium, and the minimization process should lead to the actual trajectory. Extending the example a little more, let’s consider an interface between two media with different refractive indices. Without losing generality, we can consider a situation like that of the figure 8.1. From above, we know that in each media the path will be a straight line, but what are the coordinates of the point on the y–axis (interface) where both straight lines meet? Applying (8.14) gives: time = 1 c min n1 ds P1 P0 ∫ + n2 ds P0 P2 ∫         (8.16) Both integrals correspond to the respective lengths of the two branches of the total path, but we don’t know the coordinate y0 of
the point P0. We can re-write (8.16) in the form: 2 2 2 21 1 1 0 2 2 2 0 1 min ( ) ( )time n x y y n x y y c  = + − + + −   (8.17) where the only variable (unknown) is y0. To find it we need to find the minimum, and for this we do: 1 0 2 01 22 2 2 2 0 1 1 0 2 2 0 ( ) ( )( ) 0 ( ) ( ) y y y yd time n n dy x y y x y y  − − − − = = +  + − + −  (8.18) Now, from the figure we can observe that the right hand side can be written in terms of the angles of incidence and refraction as: n1 sinα1 − n2 sinα2 = 0 (8.19) as the condition the point P0 must satisfy, and we know this is right because (8.19) is the familiar Snell’s law. So, the general idea of this variational approach is to formulate the problem in a form that we look for a stationary condition (maximum, minimum, inflexion point) on some parameter which depends on the desired solution. (Like in the above problem, we looked for the time, which depends on the actual path travelled). An important property of a variational approach is that precisely because the solution function produces a stationary value of the functional, this is rather insensitive to small perturbations of the solution (approximations). This is a very desirable property, particularly for the application of numerical methods where all solutions are only approximate. To illustrate this property, let’s analyse another example: Fig. 8.1 page 101 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez Example Consider the problem of finding the natural resonant frequencies of a vibrating string (a chord in a guitar). For this problem, an appropriate variational expression is (do not worry about where this comes from, but there is a proof in the Appendix): k 2 = s.v. dy dx       2 dx a b ∫ y2 dx a b ∫ (8.20) The above expression corresponds to the k–number or resonant frequencies of a string vibrating freely and attached at the ends at a and b. For simplicity, let’s change the limits to –a and a. The first mode of oscillation has the form: y = Acos πx 2a then sin 2 2 dy A x dx a a π π = − Using this exact solution in (99): we get for the first mode (prove this): k 2 = π 2 4a2 then k = π 2a ≈ 1.571 a Now, to show how a ‘bad’ approximation of the function y can still give a rather acceptable value for k, let’s try a simple triangular shape (instead of the correct sinusoidal shape of the vibrating string):       >      −      + = 0 1 0 1 x a xA x a xA y Fig. 8.2 then / 0 / 0 A a xdy A a xdx = − > and using these values in (99), gives (prove this): k 2 = 3 a2 then k ≈ 1.732 a which is not too bad considering how coarse is the approximation to y(x). If instead of the triangular shape we try a second order (parabolic) shape: ( )( )21y a x a= − then dydx = −2 A a2 x Substituting these values in (99) gives now: k 2 = 2.5 a2 then k = 1.581 a which is a rather good approximation. Now, how can we use this method systematically? As said before, the use of the Rayleigh- Ritz procedure permits to construct a systematic numerical method. In summary, we can specify ELEC 0030 Numerical Methods page 102 © University College London 2019 – F.A. Fernandez the necessary steps as: BV problem: L u = s → Find variational expression J(u) Use Rayleigh-Ritz: u = d j bj j=1 N ∑ Insert in J(u) Find stationary value of J(u) = J({ dj}), that is: find the coefficients dj Reconstruct u = d j bj j=1 N ∑ u is solution of BV problem ← then Many variational formulations are derived from fundamental principles: minimisation of energy, reciprocity, conservation laws, etc. They can also be formulated in general for certain types of boundary value problems. See appendix for expressions appropriate to solve problems of the type L u = s and L u = λu and their derivation. However, we will skip here the problem of actually finding the corresponding variational expression to a boundary value problem, simply saying that there are systematic methods to find them. We will be concerned here on how to solve a problem, once we already have a variational formulation. Example For the problem of the square coaxial (or square capacitor) seen earlier, the BV problem is defined by the Laplace equation: 2 0φ∇ = with some boundary conditions (L = ∇2 , u = φ, s = 0). An appropriate functional (variational expression) for this case is: ( )2( ) dJ φ φ Ω = ∇ Ω∫ (given) (8.21) Using Rayleigh-Ritz: ( ) 2 1 ( ) , N j j j d b x y dJ φ =Ω    = ∇ Ω       ∑∫ = dj ∇bj (x, y) j=1 N ∑         2 dΩ Ω ∫ 1 1 ( ) ( , ) ( , ) N N i j i j i j d d b x y b x y dJ φ = =Ω = ∇ ⋅∇ Ω∑∑∫ (8.22) This can be written as the matrix equation: J(d) = dTAd, where d = { dj} and aij = i jb b d Ω ∇ ⋅∇ Ω∫ Now, find stationary value: 0 id J∂ = ∂ for all i: i = 1, … , N page 103 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez so, applying it to (8.22): 1 0 N ij j ji a d d J = ∂ = = ∂ ∑ , for all i: i = 1, … , N And the problem reduces to the matrix equation: A d = 0 (8.23) Solving the system of equations (8.23) we obtain the coefficients dj and the unknown function can be obtained as: u(x,y) = dj bj (x, y) j=1 N ∑ We can see that both methods, the weighted residuals and the variational method transform the BV problem into an algebraic, matrix problem. One of the first steps in the implementation of either method is the choice of appropriate expansion functions to use in the Rayleigh-Ritz procedure: basis or trial functions and weighting functions. The finite element method provides a simple form to construct these functions and implementing these methods. page 104 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez 9. FINITE ELEMENTS As the name suggests, the finite element method is based on the division of the domain of interest Ω into ‘elements’, or small pieces Ei that cover Ω completely but without intersections: They constitute a tessellation (tiling) of Ω: ; i i j i E E E= Ω ∩ = ∅  Over the subdivided domain we apply the methods discussed earlier (either weighted residuals or variational). The basis functions are defined locally in each element and because each element is small, these functions can be very simple and still constitute overall a good approximation of the desired function. In this form inside each element Ee, the wanted function u( x ) in (8.1) is represented by a local approximation ( )eu x   valid only in the element number e: Ee. The complete function u( x  ) over the whole domain Ω is then simply approximated by the addition of all the local pieces: ( ) ( )e e u x u x≈ ∑  An important characteristic of this method is that it is ‘exact-in-the-limit’, that is, the degree of approximation can only improve when the number of elements increase, the solution gradually and monotonically converging to the exact value. In this form, a solution can always be obtained to any degree of approximation, provided the availability of computer resources. 9.1 One dimensional problems Let’s consider first a one dimensional case with Ω as the interval [a, b]. We first divide Ω into N subintervals, not necessarily equal in size, defined by N+1 nodes xi, as shown in Fig. 9.1. The wanted function u(x) is then locally approximated in each su
binterval by a simple function, for example, a straight line: a function of the type: ax + b, defined differently in each subinterval. The total approximation ˜ u (x) is then the superposition of all these locally defined functions, a piecewise linear approximation to u(x). The amount of error in this approximation will depend on the size of each element (and consequently on the total number of elements) and more importantly, on their size in relation to the local variation of the desired function. a x2 xix3 b ˜ u(x) u(x) Fig. 9.1 Fig. 9.2 The function ˜ u (x) , approximation to u(x), is the addition of the locally defined functions ˜ u e(x) which are only nonzero in the subinterval e. Now, these local functions ˜ u e(x) can be defined as the superposition of interpolation functions Ni(x) and Ni+1(x) as shown in the Fig. 9.2. From Fig. 9.2 we can see that the function ˜ u e(x), local approximation to u(x) in element e, can be written as: ELEC 0030 Numerical Methods page 105 © University College London 2019 – F.A. Fernandez ˜ u e(x) = ui Ni (x) + ui+1Ni+1(x) (9.1) Now, if we consider the neighbouring subintervals and the associated interpolation functions as in the figure below, we can extend the definition of these functions to form the triangular shapes below: Fig. 9.3 With this definition, we can write for the function ˜ u (x) in the complete domain Ω: u(x) ≈ ˜ u (x) = ˜ u e(x) e=1 Ne ∑ = uiNi (x) i =1 Np ∑ (9.2) (Np is the number of nodes, Ne is the number of elements). The functions Ni(x) are defined as the (double-sided) interpolation functions at node i, i = 1, … , Np so Ni(x) = 1 at node i and 0 at all other nodes. This form of expanding u(x) has a very useful property: — The trial functions Ni(x), known in the FE method as shape functions, are interpolation functions, and as a consequence, the coefficients of the expansion: ui in (104), are the nodal values, that is, the values of the wanted function at the nodal points. Solving the resultant matrix equation will give these values directly. Exercise 9.1: Examine Fig. 9.3 and Fig. 9.2 and show that indeed (9.1) is valid in the element (xi, xi+1) and so (9.2) is valid over the full domain. Example Consider the boundary value problem: d 2 y dx 2 + k2 y = 0 for x ∈ [a, b] with y(a) = y(b) = 0 This corresponds for example, to the equation describing the envelope of the transverse displacement of a vibrating string attached at both ends. A suitable variational expression (corresponding to resonant frequencies) is the following, as seen in (8.20) and in the Appendix: J = k2 = dy dx       2 dx a b ∫ y2 dx a b ∫ (given) (9.3) Using now the expansion: y = yj N j (x) j =1 N ∑ (9.4) where N is the number of nodes, yj are the nodal values (unknown) and Nj(x) are the shape functions. page 106 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez From (9.3) we have:k 2 = k2 (yj ) = Q R where: Q = d dx y j Nj (x) j=1 N ∑                 2 dx a b ∫ = yj dN j dxj=1 N ∑         2 dx a b ∫ = y j dNj dx yk dNk dxk=1 N ∑ j =1 N ∑ dx a b ∫ (9.5) and R = yj N j (x) j =1 N ∑         2 dx a b ∫ = y j Nj (x) yk Nk (x) k=1 N ∑ j=1 N ∑ dx a b ∫ (9.6) To find the stationary value, we do: dk 2 dyi = 0 for each yi, i = 1, … , N (9.7) But since k 2 = Q R , then dk 2 dyi = Q’ R − QR ‘ R2 so dk 2 dyi = 0 ⇒ Q ‘ R = QR ‘ and Q’ = Q R R ‘= k2 R ‘ so finally: dQ dyi = k2 dR dyi for all yi, i = 1, … , N (9.8) We now have to evaluate these derivatives: dQ dyi = dNi dx yk dNk dxk=1 N ∑       dx a b ∫ + yj dN j dxj=1 N ∑         dNi dx dx a b ∫ or dQ dyi = 2 dNi dx yj dNj dxj =1 N ∑         dx a b ∫ = 2 yj dNi dx dNj dx dx a b ∫         j =1 N ∑ which can be written as: dQ dyi = 2 aij yj j =1 N ∑ where aij = dNi dx dNj dx dx a b ∫ (9.9) For the second term: dR dyi = Ni(x) yj N j (x) j =1 N ∑         dx a b ∫ + yk Nk(x) k=1 N ∑       Ni(x) dx a b ∫ or dR dyi = 2 yj Ni (x)Nj (x)dx a b ∫ j =1 N ∑ = 2 bij yj j =1 N ∑ with bij = Ni (x)Nj (x)dx a b ∫ (9.10) Replacing (9.9) and (9.10) in (9.8), we can write the matrix equation: A y = k2 B y (9.11) where A = { aij} , B = { bij} and y = { yj} is the vector of nodal values. ELEC 0030 Numerical Methods page 107 © University College London 2019 – F.A. Fernandez Equation (9.11) is a matrix eigenvalue problem. The solution will give the eigenvalue k2 and the corresponding solution vector y (list of nodal values of the function y(x)). The matrix elements: aij = dNi dx dNj dx dx a b ∫ and bij = Ni Nj dx a b ∫ (9.12) The shape functions Ni and Nj are only nonzero in the vicinity of nodes i and j respectively (see Fig. 9.4). So, aij = bij = 0 if Ni and Nj do not overlap; that is, if j ≠ i–1, i, i+1. Then, the matrices A and B are tridiagonal: They will have not more than 3 elements per row. Fig. 9.4 Exercise 9.2: Define generically the triangular functions Ni, integrate (9.12) and calculate the value of the matrix elements aij and bij. 9.2 Two Dimensional Finite Elements In this case, a two dimensional region Ω is subdivided into smaller pieces or ‘elements’ and the subdivision satisfies the same properties as in the one-dimensional case; that is, the elements cover completely the region of interest and there is no intersections (no overlapping). The most common form of subdividing a 2D region is by using triangles. Quadrilateral elements are also used and they have some useful properties but by far the most common, versatile and easier to use are triangles with straight sides. There are well-developed methods to produce appropriate meshing (or subdivision) of a 2D region into triangles and they have maximum flexibility to accommodate to intricate shapes of the region of interest Ω. The process of calculation follows the same route as in the 1D case. Now, a function of two variables u(x,y) is approximated by shape functions Nj(x,y) defined as interpolation functions over one element (in this case a triangle). This approximation is given by: u(x,y) ≈ uj N j (x, y) j =1 N ∑ (9.13) where N is the number of nodes in the mesh and the coefficients uj are the nodal values (unknown). Nj(x,y) are the shape functions defined for every node of the mesh. page 108 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez Fig. 9.5 Fig. 9.5 shows a rectangular region in the xy–plane subdivided into triangles, with the corresponding piecewise planar approximation to a function u(x,y) plotted along the vertical axis. Note that the approximation is composed by flat ‘tiles’ that fit exactly along the edges so the approximation is continuous over the entire region but its derivatives are not. The approximation shown in the figure shows uses first order functions, that is, pieces of planes (flat tiles). Other types are also possible but require defining more nodes in each triangle. (While a plane is totally defined by 3 points – e.g. the nodal values, a second order surface for example, will need 6 points – for example the values at the 3 vertices and at 3 midside points). For a first order approximation, the function u(x,y) is approximated in each triangle by a function of the form (first order in x and y): ˜ u(x, y) = p + qx + ry = (1 x y) p q r           (9.14) where p, q and r are constants with different values in each triangle. Similarly to the one dimens
ional case, this function can be written in terms of shape functions (interpolation polynomials): u(x,y) ≈ ˜ u (x,y) = u1N1(x, y) + u2 N2 (x, y) + u3 N3(x, y) (9.15) for a triangle with nodes numbered 1, 2 and 3, with coordinates (x1 , y1), (x2 , y2) and (x3 , y3). The shape functions Ni are such that Ni = 1 at node i and 0 at all the others: It can be shown that the function Ni satisfying this property is: Ni (x, y) = 1 2A ai + bix + ciy( ) (9.16) where A is the area of the triangle and a1 = x2y3 – x3y2 b1 = y2 – y3 c1 = x3 – x2 a2 = x3y1 – x1y3 b2 = y3 – y1 c2 = x1 – x3 a3 = x1y2 – x2y1 b3 = y1 – y2 c3 = x2 – x1 ELEC 0030 Numerical Methods page 109 © University College London 2019 – F.A. Fernandez (See demonstration in the Appendix.) The function N1 defined in (9.16) and shown below corresponds to the shape function (interpolation function) for the node numbered 1 in the triangle shown. Now, the same node can be a vertex of other neighbouring triangles in which we will also define a corresponding shape function for node 1 (with expressions like (9.16) but with different values of the constants a, b and c – different orientation of the plane), building up the complete shape function for node 1: N1. This is shown in the next figure for a node number i, which belongs to five triangles. Fig. 9.6 Joining all the facets of Ni, for each of the triangles that contain node i, we can refer to this function as Ni(x,y) and then, considering all the nodes of the mesh, the function u can be written as: u(x,y) = uj Nj (x, y) j=1 N ∑ for j = 1, …, N (9.17) We can now use this expansion for substitution in a variational expression or in the weighted residuals expression, to obtain the corresponding matrix problem for the expansion coefficients. An important advantage of this method is that these coefficients are precisely the nodal values of the wanted function u, so the result is obtained immediately when solving the matrix problem. An additional advantage is the sparsity of the resultant matrices. page 110 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez 9.3 Matrix Sparsity Considering the global shape function for node i, shown in the Fig. 9.6, we can see that it is zero in all other nodes of the mesh. If we now consider the global shape function corresponding to another node, say, j, we can see that products of the form Ni Nj or of derivatives of these functions, which will appear in the definition of the matrix elements (as seen in the 1–D case), will almost always be zero, except when the nodes i and j are either the same node or immediate neighbours so there is an overlap (see figure above). This implies that the corresponding matrices will be very sparse, which is very convenient in terms of computer requirements. Example: If we consider a simple mesh: The matrix sparsity pattern results: Example: For the problem of finding the potential distribution in the square coaxial, or equivalently, the temperature distribution (in steady state) between the two square section surfaces, an appropriate variational expression is: J = ∇φ( )2 dΩ Ω ∫ (9.18) and substituting (9.17): J = ∇ φ j Nj j =1 N ∑         2 dx dy Ω ∫ = φ j ∇N j j =1 N ∑         2 dx dy Ω ∫ which can be re-written as: J = φi ∇Ni i=1 N ∑ ⋅ φ j ∇Nj j=1 N ∑ dx dy Ω ∫ or ELEC 0030 Numerical Methods page 111 © University College London 2019 – F.A. Fernandez J = φiφ j j=1 N ∑ i =1 N ∑ ∇Ni ⋅∇Nj dx dy Ω ∫ = ΦT AΦ (9.19) where A = { aij} , aij = ∇Ni ⋅∇Nj dx dy Ω ∫ and Φ = { φj} Note that again, the coefficients aij can be calculated and the only unknowns are the nodal values φj. We now have to find the stationary value; that is: put: ∂J ∂φi = 0 for i = 1, … , N ∂J ∂φi = 0 = 2 i=1 N ∑ aij φj for i = 1, … , N then: AΦ = 0 (9.20) Then, the resultant equation is (9.20): AΦ = 0. We need to evaluate first the elements of the matrix A. For this, we can consider the integral over the complete domain Ω as the sum of the integrals over each element Ω k, k = 1, … , Ne (Ne elements in the mesh). aij = ∇Ni ⋅∇Nj dx dy Ω ∫ = ∇Ni ⋅∇Nj dx dy Ωk ∫ k=1 Ne ∑ (9.21) Before calculating these values, let’s consider the matrix sparsity; that is, let’s see which elements of A are actually nonzero. As discussed earlier (page 45), aij will only be nonzero if the nodes i and j are both in the same triangle. In this way, the sum in (9.21) will only extend to at most two triangles for each combination of i and j. Inside one triangle, the shape function Ni(x,y), defined for the node i is: Ni (x, y) = 1 2A (ai + bi x + ciy); Then: ∇Ni = ∂Ni ∂x ˆ x + ∂Ni ∂y ˆ y = 1 2A (bi ˆ x + ci ˆ y ) And then: aij = 1 4Ak 2 (bibj + cicj ) dx dy Ωk ∫ k=1 Ne ∑ = 14Ak (bibj + cicj ) k=1 Ne ∑ (9.2) where the sum will only have a few terms (for those values of k corresponding to the triangles containing nodes i and j. The values of Ak, the area of the triangle and bi, bj, ci and cj will be different for each triangle concerned. In particular, considering for example the element a4 7 corresponding to the mesh in the previous figure, the sum extends over the triangles containing nodes 4 and 7; that is triangles number 5 and 6: a4 7 = 1 4A5 b4 (5)b7 (5) + c4 (5)c7 (5)( )+ 14A6 b4 (6)b7 (6) + c4 (6)c7 (6)( ) In this case, the integral in (9.22) reduces simply to the area of the triangle and the calculations are easy. However, in other cases the integration can be complicated because they have to be done over the triangle, which is at any position and orientation in the x–y plane. For example, solving a variational expression like: page 112 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez J = k2 = (∇φ)2 dΩ∫ φ2 dΩ∫ (corresponding to (9.3) for 1D problems) (9.23) This results in an eigenvalue problem AΦ = k2BΦ where the matrix B has the elements: bij = Ni Nj dΩ Ω ∫ (9.24) The elements of A are the same as in (9.24). Exercise 9.3: Apply the Rayleigh-Ritz procedure to expression (9.23) and show that indeed the matrix elements aij and bij have the values given in (9.21) and (9.24). For the case of integrals like those in (126): bij = Ni Nj dΩ Ω ∫ = Ni Nj dΩ Ω k ∫ k=1 Ne ∑ The sum is over all triangles containing nodes i and j. For example for the element 5,6 in the previous mesh, these will be triangles 2 and 7 only. If we take triangle 7, its contribution to this element is the term: b56 = 1 4A7 (a5 + b5x + c5y)(a6 + b6 x + c6 y) dx dy Ω 7 ∫ or b56 = 1 4A7 [a5a6 + (a5b6 + a6b5)x + (a5c6 + a6c5)y + b5b6x 2 Ω 7 ∫ +(b5c6 + b6c5)xy + c5c6 y 2 ] dx dy Integrals like these need to be calculated for every pair of nodes in every triangle of the mesh. These calculations can be cumbersome is attempted directly. However, it is much simpler to use a transformation of coordinates, into a local system. This has the advantage that the integrals can be calculated for just one model triangle and then the result converted back to the x and y coordinates. The most common system of local coordinates used for this purpose is the triangle area coordinates. 9.4 Triangle Area Coordinates For a triangle as in Fig. 9.7, any point inside can be specified by coordinates x and y or for example, by the coordinates ξ1 and ξ2. These coordinates are defined with value 1 at one node and zero at the opposite side, varying linearly between these limits. For the sake of symmetry, we can define 3 coordinates: (ξ1, ξ2, ξ3) when obviously, only two are independent. A formal definition of these coordinates ELEC 0030 Numerical Methods page 113
© University College London 2019 – F.A. Fernandez can be made in terms of the area of the triangles shown in Fig. 9.7. Fig. 9.7 Triangle area coordinates The advantage of using this local coordinate system is that the actual shape of the triangle is not important. Any point inside is defined by its proportional distance to the sides, irrespective of triangle shape. In this way, calculations can be made in model triangle using this system and then, mapped back to the global coordinates. If A1, A2 and A3 are the areas of the triangles formed by P and two of the vertices of the triangle, the area coordinates are defined as the ratio: ξi = Ai A where A is the area of the triangle. Note that the triangle marked with the dotted line has the same area as A1 (same base, same height), so the line of constant ξ1 is the one marked in Fig. 9.8. Fig. 9.8 Since A1 + A2 + A3 = A we have that ξ1 + ξ2 + ξ3 =1, which shows their linear dependence. The area of each of these triangles, for example A1, can be calculated using (see Appendix): 1 2 2 3 3 1 1 det 1 2 1 x y A x y x y    =       where x and y are the coordinates of point P. Evaluating this determinant gives: A1 = 1 2 x2y3 − x3y2( )+ y2 − y3( )x + x3 − x2( )y[ ] and using the definitions in the Appendix: A1 = 1 2 a1 + b1x + c1y( ) Then, ξ1 = A1 A = 1 2A a1 + b1x + c1y( ) or ξ1 = N1(x, y) (9.25) We have then, that these coordinates vary in the same form as the shape functions, which is quite convenient for calculations. Expression (9.25) also gives us the required relationship between the local (ξ1,ξ2,ξ3) and global coordinates (x,y); that is the expression we need to convert coordinates (x,y into ξ1,ξ2,ξ). We can also find the inverse relation, that is (x,y) in terms of (ξ1,ξ2,ξ3). This is all we need to convert from one system of coordinates to the other. For the inverse relationship, we have from the Appendix: N1 N2 N3( )= ξ1 ξ2 ξ3( )= 1 x y( ) 1 x1 y1 1 x2 y2 1 x3 y3           −1 from where: page 114 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez 1 x y( ) = ξ1 ξ2 ξ3( ) 1 x1 y1 1 x2 y2 1 x3 y3           and expanding: 1 = ξ1 + ξ2 +ξ3 x = x1ξ1 + x2ξ2 + x3ξ3 (9.26) y = y1ξ1 + y2ξ2 + y3ξ3 Equations (9.25) and (9.26) will allow us to change from one system to the other. Finally, the evaluation of integrals can be made now in terms of the local coordinates in the usual way: 1 2 3 1 2( , ) ( , , ) k k f x y dxdy f J d dξ ξ ξ ξ ξ Ω Ω =∫ ∫ where J is the Jacobian of the transformation. In this case this is simply 2A, so the expression we need to use to transform integrals is: f (x, y) dxdy Ωk ∫ = 2A f (ξ1,ξ2,ξ3) dξ1dξ2 Ωk ∫ (9.27) Example: The integral (9.24) of the previous exercise, is difficult to calculate in terms of x and y for an arbitrary triangle, and needs to be calculated separately for each pair of nodes in each triangle of the mesh. Transforming to (local) area coordinates this is much simpler: NiN j dxdy Ωk ∫ = 2 A ξiξ j dξ1dξ2 Ωk ∫ (9.28) To determine the limits of integration, we can see that the total area of the triangle can be covered by ribbons like that shown in Fig. 9.9, with a width dξ1 and length ξ2. Now, we can note that at the left end of this ribbon, ξ3 = 0, so ξ2 = 1–ξ1. Moving the ribbon from the side 2–3 of the triangle to the vertex 1 (ξ1 changing from 0 to 1) will cover the complete triangle, so the limits of integration must be: ξ2: 0 ––> 1 – ξ1 and ξ1: 0 ––> 1 Fig. 9.9 So that the integral of (9.28) results: NiN j dxdy Ωk ∫ = 2 A ξiξ j dξ2dξ1 0 1−ξ1 ∫ 0 1 ∫ (9.29) Note that the above conclusion about integration limits is valid for any integral over the triangle, not only the one used above. We can now calculate these integrals. Taking first the case where i ≠ j: ELEC 0030 Numerical Methods page 115 © University College London 2019 – F.A. Fernandez a) choosing i = 1 and j = 2 (This is an arbitrary choice – you can check that any other choice, e.g. 1,3 will give the same result). Iij = 2A ξ1 ξ2 dξ2dξ1 0 1−ξ1 ∫ 0 1 ∫ = A ξ1(1− ξ1)2 dξ1 0 1 ∫ = A 1 2 − 2 3 + 1 4       = A 12 b) For i = j and choosing i = 1: Iii = 2 A ξ1 2 dξ2dξ1 0 1−ξ1 ∫ 0 1 ∫ = 2A ξ12 (1 − ξ1)dξ1 0 1 ∫ = 2A 1 3 − 1 4       = A 6 Once calculated in this form, the result can be used for any triangle irrespective of the shape and position. We can see that for this integral, only the area A will change when applied to different triangles. 9.5 Higher Order Shape Functions In most cases involving second order differential equations, the resultant weighted residuals expression or the corresponding variational expression can be written without involving second order derivatives. In these cases, first order shape functions will be fine. However, if this is not possible, other type of elements/shape functions must be used. (Note that second order derivatives of a first order function will be zero everywhere.) We can also choose to use different shape functions even if we could use first order polynomials, for example, to get higher accuracy with less elements. 9.6 Second Order Shape Functions: To define a first order polynomial in x and y (planar shape functions) we needed 3 degrees of freedom (the nodal values of u will define completely the approximation. If we use second order shape functions, we need to fit this type of surface over each triangle. To do this uniquely, we need six degrees of freedom: either the 3 nodal values and the derivatives of u at those points or the values at six points on the triangle. (This is analogous to trying to fit a second order curve to an interval – while a straight line is completely defined by 2 points, a second order curve will need 3). Choosing the easier option, we form triangles with 6 points: the 3 vertices and 3 midside points: Fig. 9.10 shows a triangle with 6 points identified by their triangle coordinates. We need to specify now the shape functions. If we take first the function N1(ξ1, ξ2, ξ3), corresponding to node 1, we know that it should be 1 there and zero at every other node. That is, it must be 0 at ξ1 = 0 (nodes 2, 3 and 5) and also at ξ1 = 1/2 (nodes 4 and 6) Then, we can simply write: N1 = ξ1 2ξ1 −1( ) (9.30) Fig. 9.10 page 116 ELEC 0030 Numerical Methods © University College London 2019 – F.A. Fernandez In the same form, we can write the corresponding shape functions for the other vertices (nodes 2 and 3). For the mid-side nodes, for example node 4, we have that N4 should be zero at all nodes except 4 where its value is 1. We can see that all other nodes are either on the side 2–3 (where ξ1 = 0) or on the side 3–1 (where ξ2 = 0). So the function N4 should be: N4 = 4ξ1ξ2 (9.31) Fig. 9.11 shows the two types of second order shape functions, one for vertices and one for mid-side points. Fig. 9.11a Fig. 9.11b Exercise 9.4: For the mesh of second order triangles of the figure below, find the corresponding sparsity pattern. Exercise 9.5: Use a similar reasoning as that used to define the second order shape functions (9.30)-(9.31), to find the third order shape functions in terms of triangle area coordinates. Note that in this case there are 3 different types of functions. ELEC 3030 Numerical Methods Appendix page A1 © University College London 2019 – A. Fernandez APPENDIX 1. Taylor theorem For a continuous function we have )()()(‘ afxfdttf x a −=∫ , then, we can write: ∫+= x a dttfafxf )(‘)()( or )()()( 0 xRafxf += (A1.1) where the reminder is ∫= x a dttfxR )(‘)(0 . We can now integ
rate R0 by parts using: (2)'( ) ( ) ( ) u f t du f t dt dv dt v x t = = = = − − giving: ∫∫ −+−−== x a x a x a dttftxtftxdttfxR )()()(‘)()(‘)( )2(0 which gives after solving and substituting in (A1.1): ∫ −+−+= x a dttftxafaxafxf )()()(‘)()()( )2( or )())((‘)()( 1 xRaxafafxf +−+= (A1.2) We can also integrate R1 by parts using this time: (2) (3) 2 ( ) ( ) ( )( ) 2 u f t du f t dt x tdv x t dt v = = − = − = − which gives: ∫∫ −+−−=−= x a x a x a dttftxtxtfdttftxxR )()()( 2 )()()()( )3(22 )2( )2( 1 and again, after substituting in (A1.2) gives: ∫ −+−+−+= x a dttftxaxafaxafafxf )()()( 2 )())((‘)()( )3(22 )2( Proceeding in this way we get the expansion: n n n Rax n afaxafaxafaxafafxf +−++−+−+−+= )( ! )()( !3 )()( 2 )())((‘)()( )( 3 )3( 2 )2(  (A1.3) where the reminder can be written as: ∫ + − = x a n n n dttfn txxR )( ! )()( )1( (A1.4) To find a more useful for the reminder we need to invoke some general mathematical theorems: page A2 ELEC 3030 Numerical Methods Appendix © University College London 2019 – A. Fernandez First Theorem for the Mean Value of Integrals If the function )(tg is continuous and integrable in the interval [a, x], then there exists a point ξ between a and x such that: ))(()( axgdttg x a −=∫ ξ . This says simply that the integral can be represented by an average value of the function )(ξg times the length of the interval. Because this average value must be between the minimum and the maximum values of the function in that interval; and if the function is continuous, there will be a point ξ for which the function has this value. And in a more complex form: Second Theorem for the Mean Value of Integrals Now if the functions g and h are continuous and integrable in the interval and h does not change sign in the interval, there exists a point ξ between a and x such that: ∫∫ = x a x a dtthgdtthtg )()()()( ξ If we now use this second theorem on expression (A1.4), with: )()( )1( tftg n+= and ! )()( n txth n− = , we get: ∫ − = + x a n n n dtn txfxR ! )()()( )1( ξ which can be integrated, giving: 1 )1( )( )!1( )()( + + − + = n n n txn fxR ξ (A1.5) for the reminder of the Taylor expansion, where ξ is appoint between a and x. ELEC 3030 Numerical Methods Appendix page A3 © University College London 2019 – A. Fernandez 2. Implementation of Gaussian Elimination Referring to the steps (5.14) – (5.16) to eliminate the non-zero entries in the lower triangle of the matrix, we can see that (in step (5.15)) to eliminate the unknown x1 from equation i (i = 2 – N); – that is, to eliminate the matrix elements ai1, we need to subtract from eqn. i equation 1 times ai1/a11. With this, all the elements of column 1 except a11 will be zero. After that, to remove x2 from equations 3 to N (or elements ai2 , i = 3 – N as in (5.16)), we need to scale the new equation 2 by the factor ai2/a22 and subtract it from row i, i = 3 – N. The pattern now repeats in the same form, so the steps for triangularizing matrix A (keeping just the upper triangle) can be implemented by the following code (written here in Matlab)1: for k=1:n-1 v(k+1:n)=A(k+1:n,k)/A(k,k); % find multipliers for i=k+1:n A(i,k+1:n)=A(i,k+1:n)-v(i)*A(k,k+1:n); end end In fact, this can be simplified eliminating the second loop, by noting that all operations on rows can be performed simultaneously. The simpler version of the code is then: for k=1:n-1 v(k+1:n)=A(k+1:n,k)/A(k,k); % find multipliers A(k+1:n,k+1:n)=A(k+1:n,k+1:n)-v(k+1:n)*A(k,k+1:n); end U=triu(A); % This function simply puts zeros in the lower triangle The factorization is completed by calculating the lower triangular matrix L. The complete procedure can be implemented as follow: function [L,U] = GE(A) % % Computes LU factorization of matrix A % input: matrix A % output: matrices L and U % [n,n]=size(A); for k=1:n-1 A(k+1:n,k)=A(k+1:n,k)/A(k,k); % find multipliers A(k+1:n,k+1:n)=A(k+1:n,k+1:n)-A(k+1:n,k)*A(k,k+1:n); end L=eye(n,n) + tril(A,-1); U=triu(A); This function can be used to find the LU factors of a matrix A using dense storage. The function eye(n,n) returns the identity matrix of order n and tril(A,-1) gives a lower triangular matrix with the elements of A in the lower triangle, excluding the diagonal, and setting all others 1 Note that if the problem is actually solved in Matlab there is no need to perform all the steps described here. The command x=A\b will give the solution of Ax=b using LU factorization if there is not a simpler method. page A4 ELEC 3030 Numerical Methods Appendix © University College London 2019 – A. Fernandez to zero (lij = aij if j ≤ i–1, and 0 otherwise). The solution of a linear system of equations Ax = b is completed (after the LU factorization of the matrix A is performed) with a forward substitution using matrix L followed by backward substitution using the matrix U. Forward substitution consists of finding the first unknown x1 directly as b1/l11 , substituting this value in the second equation, find x2 and so on. This can be implemented by: function x = LTriSol(L,b) % % Solves the triangular system Lx = b by forward substitution % n=length(b); x=zeros(n,1); % a vector of zeros to start for j=1:n-1 x(j)=b(j)/L(j,j); b(j+1:n)=b(j+1:n)-x(j)*L(j+1:n,j); end x(n)=b(n)/L(n,n); Backward substitution can be implemented in a similar form, this time the unknowns are found from the end upwards: function x = UTriSol(L,b) % % Solves the triangular system Ux = b by backward substitution % n=length(b); x=zeros(n,1); for j=n:-1:2 % from n to 2 one by one x(j)=b(j)/U(j,j); b(1:j-1)=b(1:j-1)-x(j)*U(1:j-1,j); end x(1)=b(1)/U(1,1); With these functions the solution of the system of equations Ax = b can be performed in three steps by the code: [L,U] = GE(A); y = LTriSol(L,b); x = UTriSol(U,y); Exercise Use the functions GE, LTriSol and UTriSol to solve the system of equations generated by the finite difference modelling of the square coaxial structure, given by equation (7.30). You will have to complete first the matrix A given only schematically in (7.30), after applying the boundary conditions. Note that because of the geometry of the structure, not all rows will have the same pattern. To input the matrix it might be useful to start with the matlab command: ELEC 3030 Numerical Methods Appendix page A5 © University College London 2019 – A. Fernandez A = triu(tril(ones(n,n),1),-1)-5*eye(n,n) This will generate a tridiagonal matrix of order n with –4 in the main diagonal and 1 in the two subdiagonals. After that you will have to adjust any differences between this matrix and A. Compare the results with those obtained by Gauss-Seidel. page A6 ELEC 3030 Numerical Methods Appendix © University College London 2019 – A. Fernandez 3. Solution of Finite Difference Equations by Gauss-Seidel Equation (7.30) is the matrix equation derived by applying the finite differences method to the Laplace equation. Solving this system of equation by the method of Gauss-Seidel, as shown in (5.18) and (5.19), consists of taking the original simultaneous equations, putting all diagonal matrix terms on the left-hand-side as in (5.19) and effectively putting them in a DO loop (after initializing the ‘starting vector’). Equation (7.25) is the typical equation, and putting the diagonal term on the left-hand-side gives: φO = 1 4 φN +φS + φE + φW( ) We could write 56 lines of code (one per equation) or even simpler, use subscripted variables inside a loop. In this case the elements of A are all either 0, +1 or -4, and are easily “generated during the algorithm”, rather that actually stored in an array. This simplifies the computer program, and instead of A, th
e only array needed holds the current value of vector elements: xT = φ1, φ2, … ,φ56( ). The program can be simplified further by keeping the vector of 56 unknowns x in a 2-D array z(11,11) to be identified spatially with the 2-D Cartesian coordinates of the physical problem (see figure). For example z(3,2) stores the value of φ10. There is obviously scope to improve efficiency since with this arrangement we store values corresponding to nodes with known, fixed voltage values, including all those nodes inside the inner conductor. None of the actually need to be stored, but doing so makes the program simpler. In this case, the program (in old Fortran 77) can be as simple as: c Gauss-Seidel to solve Laplace between square inner & outer c conductors. Note that this applies to any potential problem, c including potential (voltage) distribution or steady state c current flow or heat flow. c dimension z(11,11) data z/121*0./ do i=4,8 do j=4,8 z(i,j)=1. enddo enddo c do n=1,30 do i=2,10 do j=2,10 if(z(i,j).lt.1.) z(i,j)=.25*(z(i–1,j)+z(i+1,j)+ $ z(i,j–1)+z(i,j+1)) enddo enddo c in each iteration print values in the 3rd row write(*,6) n,(z(3,j),j=1,11) enddo ELEC 3030 Numerical Methods Appendix page A7 © University College London 2019 – A. Fernandez c write(*,7) z 6 format(1x,i2,11f7.4) 7 format(‘FINAL RESULTS=’,//(/1x,11f7.4)) stop end In the program, first z(i,j) is initialized with zeros, then the values corresponding to the inner conductor are set to one (1 V). After this, the iterations start (to a maximum of 30) and the Gauss-Seidel equations are solved. In order to check the convergence, the values of potentials in one intermediate row (the third) are printed after every iteration. We can see that after 19 iterations there are no more changes (within 4 decimals). Naturally, a more efficient monitoring of convergence can be implemented whereby the changes are monitored, either in a point by point basis, or as the norm of the difference, and the iterations are stopped when this value is within a prefixed precision. The results are: 1 0.0000 0.0000 0.0000 0.2500 0.3125 0.3281 0.3320 0.3330 0.0833 0.0208 0.0000 2 0.0000 0.0000 0.1250 0.3750 0.4492 0.4717 0.4785 0.4181 0.1895 0.0699 0.0000 3 0.0000 0.0469 0.2109 0.4473 0.5225 0.5472 0.5399 0.4737 0.2579 0.1098 0.0000 4 0.0000 0.0908 0.2690 0.4922 0.5653 0.5865 0.5742 0.5082 0.3016 0.1369 0.0000 5 0.0000 0.1229 0.3076 0.5205 0.5902 0.6084 0.5944 0.5299 0.3293 0.1542 0.0000 6 0.0000 0.1446 0.3326 0.5381 0.6047 0.6211 0.6067 0.5435 0.3465 0.1650 0.0000 7 0.0000 0.1586 0.3484 0.5488 0.6133 0.6288 0.6143 0.5519 0.3572 0.1716 0.0000 8 0.0000 0.1675 0.3582 0.5553 0.6184 0.6334 0.6190 0.5572 0.3637 0.1756 0.0000 9 0.0000 0.1729 0.3641 0.5592 0.6215 0.6362 0.6218 0.5604 0.3676 0.1780 0.0000 10 0.0000 0.1763 0.3678 0.5616 0.6234 0.6379 0.6236 0.5624 0.3700 0.1794 0.0000 11 0.0000 0.1783 0.3700 0.5631 0.6245 0.6390 0.6247 0.5636 0.3713 0.1802 0.0000 12 0.0000 0.1795 0.3713 0.5639 0.6253 0.6396 0.6254 0.5643 0.3722 0.1807 0.0000 13 0.0000 0.1802 0.3721 0.5645 0.6257 0.6400 0.6258 0.5647 0.3727 0.1810 0.0000 14 0.0000 0.1807 0.3726 0.5648 0.6259 0.6403 0.6260 0.5649 0.3729 0.1812 0.0000 15 0.0000 0.1810 0.3729 0.5650 0.6261 0.6404 0.6261 0.5651 0.3731 0.1813 0.0000 16 0.0000 0.1811 0.3731 0.5651 0.6262 0.6405 0.6262 0.5652 0.3732 0.1813 0.0000 17 0.0000 0.1812 0.3732 0.5652 0.6263 0.6406 0.6263 0.5652 0.3733 0.1813 0.0000 18 0.0000 0.1813 0.3732 0.5652 0.6263 0.6406 0.6263 0.5652 0.3733 0.1814 0.0000 19 0.0000 0.1813 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000 20 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000 21 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000 22 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000 23 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000 24 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000 25 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000 26 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000 27 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000 28 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000 29 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000 30 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000 FINAL RESULTS= 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0907 0.1848 0.2615 0.2994 0.3099 0.2994 0.2615 0.1814 0.0907 0.0000 page A8 ELEC 3030 Numerical Methods Appendix © University College London 2019 – A. Fernandez 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000 0.0000 0.2615 0.5653 1.0000 1.0000 1.0000 1.0000 1.0000 0.5653 0.2615 0.0000 0.0000 0.2994 0.6263 1.0000 1.0000 1.0000 1.0000 1.0000 0.6263 0.2994 0.0000 0.0000 0.3099 0.6406 1.0000 1.0000 1.0000 1.0000 1.0000 0.6406 0.3099 0.0000 0.0000 0.2994 0.6263 1.0000 1.0000 1.0000 1.0000 1.0000 0.6263 0.2994 0.0000 0.0000 0.2615 0.5653 1.0000 1.0000 1.0000 1.0000 1.0000 0.5653 0.2615 0.0000 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000 0.0000 0.0907 0.1848 0.2615 0.2994 0.3099 0.2994 0.2615 0.1814 0.0907 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ELEC 3030 Numerical Methods Appendix page A9 © University College London 2019 – A. Fernandez 4. Matrix – vector products Some expressions that often need to be evaluated are the products between a matrix and vectors. For example, if the matrix A is defined with elements aij, that is: A = {aij} and the vector b = {bj}, the product Ab = c is the vector: c = {ci} where 1 N i ij j j c a b = = ∑ . 1 111 1 121 1 N j j j ij N Nj j jN N b ca a b a a a b b c = =                         = =                         ∑ ∑        The quadratic form (scalar) Ts = x Ax for A = {aij} and x = {xj} also appears frequently: We can call the product Ax by c calculated as before, so { } 1 N i ij j j c a x =   = =     ∑c Pre-multiplying the vector c by xT gives: 1 N T i i i s x c = = = ∑x c (equivalent to the dot product) And replacing the expression for c: 1 1 1 1 N N N N i ij j i ij j i j i j s x a x x a x = = = = = =∑ ∑ ∑∑ page A10 ELEC 3030 Numerical Methods Appendix © University College London 2019 – A. Fernandez 5. Error residual in gradient methods In the solution of matrix problems using gradient methods it is convenient to minimise the functional (5.24) 2 1 2 T Th = −x Ax x y (A5.1) instead of the error residual. Here we demonstrate that minimising (A#.1) is equivalent to solve the system of equations =Ax y if A is symmetric and positive definite.. Considering changes of 2h when x varies along the line: α+x p for a real α and a vector p of fixed direction. First, we show that 2 2( ) ( )h hα+ >x p x : 2 1( ) ( ) ( ) ( ) 2 T Th α α α α+ = + + + +x p x p A x p x p y 2 21 1 1 1( ) 2 2 2 2 T T T T T Th xα α α α α+ = + + + − −x p x Ax x Ap p Ax p Ap y p y if A is symmetric ( T=A A ) then Tαx Ap = Tαp Ax an we can write: 2 21 1( ) ( 2 ) 2 2 T T T T Th xα α α α+ = − + − +x p x Ax y p Ax p y p Ap The first terms in the right hand side correspond to 2( )h x : 2 2 21( ) ( ) ( ) 2 T Th hα α α+ = + − +x p x p Ax y p Ap where the last term is always positive. We can see that 2h is a quadratic funct
ion of α with a positive leading coefficient and as such, will have a minimum in the line α+x p . The value of α that gives the minimum is found by: 2( ) 0d h d α α + =x p and is: ( )T Tα − = p y A x p A p  (see Exercise 5.1). The minimum value of 2h is then: ( ) 22 2 [ ]( ) ( ) T Th hα − + = − p y A x x p x p A p  where the last term is positive. This means that 2 2( ) ( )h hα+ That is, p is not orthogonal to the residual r = y – Ax. Let’s consider the two possibilities: i) x is such that Ax = y. Then 2 2( ) ( )h hα+ =x p x and this is the minimum value. ii) x is such that Ax ≠ y. Then, 2 2( ) ( )h hα+ ( ) 0T − ≠p y A x and 2( )h x is not the minimum. In consequence, minimising 2( )h x is equivalent to find the solution of Ax = y. ELEC 3030 Numerical Methods Appendix page A11 © University College London 2019 – A. Fernandez 6. Derivation of Second Order Runge Kutta The second order Taylor expansion: 2 3 1 ( , ) ‘( , ) ( )2i i i i i i hy y hf t y f t y O h+ = + + + (A6..1) can be written as: 2 3 1 , , ( , ) ( , ) ( ) 2 i i i i i i i i i i t y t y h f fy y hf t y f t y O h t y+ ∂ ∂  = + + + + ∂ ∂  (A6..2) Rewriting this in the form: 1 0 1 1 2( )i iy y h c k c k+ = + + (A6..3) where: 1 ( , )i ik f t y= and 2 ( , ( , ))i i i ik f t ph y qhf t y= + + , which gives: 1 0 1( , ) ( , ( , ))i i i i i i i iy y hc f t y hc f t ph y qhf t y+ = + + + + (A6..4) Now, the function f with variations in t and y can be represented as a first order Taylor expansion in terms of the two variables as: 2 , , ( , ) ( , ) ( ) i i i i i i i i t y t y f ff t p t y q y f t y p t q y O h t y ∂ ∂ + ∆ + ∆ = + ∆ + ∆ + ∂ ∂ (A6..5) but the variations are: t h∆ = and , ( , ) i i i i t y dyy t hf t y dt ∆ = ∆ = so (A6..4) becomes: 21 1 , , ( , ) ( , ) ( ) i i i i i i i i t y t y f ff t ph y qk h f t y ph qk h O h t y ∂ ∂ + + = + + + ∂ ∂ (A6..6) Replacing (A6..6) in (A6..4) leads to: 21 0 1 1 , , ( , ) ( , ) ( ) i i i i i i i i i i t y t y f fy y hc f t y hc f t y ph qk h O h t y+ ∂ ∂  = + + + + +  ∂ ∂  (A6..7) which has the same form as (A6..2). Equating terms between (A6..2) and (A6..7) give the equations for the unknown coefficients is (A6..3): 0 1 1 1 1 1 2 1 2 c c c p c q + = = = (A6..8) There are three equations and four coefficients, so one of them can be chosen arbitrarily. page A12 ELEC 3030 Numerical Methods Appendix © University College London 2019 – A. Fernandez 7. Derivation of a variational formulation For a boundary value problem specified by the equation: u s=L (A7.1) where the operator L is self-adjoint† the following expression is a corresponding variational functional: = , 2 ,u u s u−J L (A7.2) Demonstration: Allowing a variation or perturbation uδ to the function u we get a corresponding variation in J leading to: = ( ), 2 ,u u u u s u uδ δ δ δ+ + + − +J J L Expanding the right hand side and keeping only first order perturbations (discarding terms with higher orders in uδ ): , = , , , 2 , 2 ,u u u u u u s u s uδ δ δ δ+ + + − −J J L L L using (A7.2) leads to: = , , 2 ,u u u u s uδ δ δ δ+ −J L L Now, since L is self-adjoint: , ,u u u uδ δ=L L so finally: = 2 , 2 , 2 ,u u s u u s uδ δ δ δ− = −J L L (A7.3) Seeking the stationary condition: = 2 , 0u s uδ δ− =J L and since uδ is arbitrary, this implies that the equation (A7.1) is satisfied. That is, finding the function u that renders the functional J stationary, is equivalent to find the solution to (A7.1). If the condition for self adjointness of the operator L includes some boundary terms, these will also appear in (A7.3) and will imply that additionally to solve (A7.1) the function u will satisfy some boundary condition related to these boundary terms. †An operator L is said to be self-adjoint is it satisfies the condition: , , boundary termsu v u v= +L L ELEC 3030 Numerical Methods Appendix page A13 © University College London 2019 – A. Fernandez 8. A variational formulation for the wave equation Equation (8.20) shows the use of the variational method to find a solution to the one- dimensional wave equation. In this section, it will be shown that finding the stationary value of of the functional (8.20) is equivalent to solve the differential equation Equation (8.20) states: k 2 = s.v. dy dx       2 dx a b ∫ y2 dx a b ∫ (A8.1) Let’s examine a variation of k2 produced by a small perturbation δy on the solution y: k 2(y + δy) = k2 + δk 2 = d(y + δy) dx       2 dx a b ∫ (y +δy)2 dx a b ∫ And re-writing: k2 +δk2( ) (y +δy)2 dx a b ∫ = d(y + δy) dx       2 dx a b ∫ (A8.2) Expanding and neglecting higher order variations: k2 +δk2( ) (y2 + 2yδy) dx a b ∫ = dy dx       2 + 2 dy dx dδy dx       dx a b ∫ or k 2 y2 dx a b ∫ + 2k2 yδy dx a b ∫ + δk2 y2 dx a b ∫ = dy dx       2 dx a b ∫ + 2 dy dx dδy dx dx a b ∫ (A8.3) and now using (A8.1): 2k2 yδy dx a b ∫ +δk 2 y2 dx a b ∫ = 2 dy dx dδy dx dx a b ∫ (A8.4) Now, since we want k2 to be stationary about the solution function y, we make δk2 = 0, and we examine what conditions this imposes on the function y: δk2 = 0 implies: k 2 yδy dx a b ∫ = dy dx dδy dx dx a b ∫ (A8.5) Integrating the RHS by parts: k 2 yδy dx a b ∫ = dy dx δy a b − δy d2 y dx2 dx a b ∫ page A14 ELEC 3030 Numerical Methods Appendix © University College London 2019 – A. Fernandez Or re-arranging: δy d2 y dx2 + k2 y       dx a b ∫ = dy dx δy a b (A8.6) Since δy is arbitrary, (A8.6) can only be valid if both sides are zero. That means that the function y that makes the functional stationary should satisfy the differential equation: d 2 y dx 2 + k2 y = 0 (A8.7) Similarly, the boundary condition dy dx = 0 at a and b (A8.8) is satisfied since δy is arbitrary. The equation (A8.7) resulting from forcing the functional stationary is called the Euler equation of this functional and the boundary condition (A8.8) is the natural boundary condition. Summarizing, we can see that imposing the stationary condition of (A8.1) with respect to small variations of the function y leads to y satisfying the differential equation (A8.7), which is the wave equation, and the boundary conditions (A8.8); that is, zero normal derivative (Neumann B.C.). If a different boundary condition is desired it has to be imposed directly. For the Dirichlet boundary condition, that is, fixed values of y at the ends, fixing these values on the boundary implies that y is not allowed to vary on the boundary and then δy = 0 in the boundary. ELEC 3030 Numerical Methods Appendix page A15 © University College London 2019 – A. Fernandez 9. Area of a Triangle For a triangle with nodes 1, 2 and 3 with coordinates (x1, y1), (x2, y2) and (x3, y3): The area of the triangle is: A = Area of rectangle – area(A) – area(B) – area(C) Area of rectangle = (x2 − x1)(y3 − y2 ) = (x2 y3 + x1y2 − x1y3 − x2 y2 ) Area (A) = 12 (x2 − x3 )(y3 − y2 ) = 1 2 (x2 y3 + x3y2 − x2 y2 − x3y3 ) Area (B) = 12 (x3 − x1 )(y3 − y1) = 1 2 (x3y3 − x3y1 − x1y3 + x1y1) Area (C) = 12 (x2 − x1)(y1 − y2 ) = 1 2 (x2 y1 − x2y2 − x1y1 + x1y2 ) Then, the area of the triangle is: A = 12 (x2y3 − x3y2 ) + (x3y1 − x1y3) + (x1y2 − x2 y1)[ ] (A9.1) which can be written as: A = 1 2 det 1 x1 y1 1 x2 y2 1 x3 y3 (A9.2) page A16 ELEC 3030 Numerical Methods Appendix © University College London 2019 – A. Fernandez 10. Shape Functions (Interpolatio
n Functions) The function u is approximated in each triangle by a first order function (a plane). This will be given by an expression of the form: p + qx + ry which can also be written as a vector product (equation 9.14). ˜ u(x, y) = p + qx + ry = (1 x y) p q r           (A10.1) Evaluating this expression at each node of a triangle (with nodes numbered 1, 2 and 3): u1 = p + qx1 + ry1 u2 = p + qx2 + ry2 u3 = p + qx3 + ry3      ⇒ u1 u2 u3           = 1 x1 y1 1 x2 y2 1 x3 y3           p q r           (A10.2) And from here we can calculate the value of the constants p, q and r in terms of the nodal values and the coordinates of the nodes: p q r           = 1 x1 y1 1 x2 y2 1 x3 y3           −1 u1 u2 u3           (A10.3) Replacing (A10. 3) in (A10.1): ˜ u (x, y) = (1 x y) 1 x1 y1 1 x2 y2 1 x3 y3           −1 u1 u2 u3           (A10.4) and comparing with (9.15), written as a vector product: u(x,y) ≈ ˜ u (x,y) = u1N1(x, y) + u2 N2 (x, y) + u3 N3(x, y) = (N1 N2 N3 ) u1 u2 u3           (A10.5) we have finally: (N1 N2 N3) = (1 x y) 1 x1 y1 1 x2 y2 1 x3 y3           −1 (A10.6) Solving the right hand side (inverting the matrix and multiplying), gives the expression for each shape function (9.16): Ni (x, y) = 1 2A ai + bix + ciy( ) (A10.7) where A is the area of the triangle and ELEC 3030 Numerical Methods Appendix page A17 © University College London 2019 – A. Fernandez a1 = x2y3 – x3y2 b1 = y2 – y3 c1 = x3 – x2 a2 = x3y1 – x1y3 b2 = y3 – y1 c2 = x1 – x3 (A10.8) a3 = x1y2 – x2y1 b3 = y1 – y2 c3 = x2 – x1 Note that from (A10.1), the area of the triangle can be written as: A = 12 a1 + a2 + a3( ) (A10.9) page A18 ELEC 3030 Numerical Methods Appendix © University College London 2019 – A. Fernandez

admin

Author admin

More posts by admin