辅导案例-ECM3420

AN SW ER S ECM3420 (with Answers) UNIVERSITY OF EXETER COLLEGE OF ENGINEERING, MATHEMATICS AND PHYSICAL SCIENCES COMPUTER SCIENCE Examination, May 2019 Learning from Data Module Leader: Dr. Lorenzo Livi Duration: TWO HOURS Answer question 1, and two out of the other four questions. Question 1 is worth 40 marks. Other questions are worth 30 marks each. This assessment will contribute 60% to the final mark for this module. No electronic calculators of any sort are to be used during the course of this examination. This is a CLOSED BOOK examination. AN SW ER S ECM3420 – 28/Jan/2019 – with answers Page 1 SECTION A (Compulsory Question) Question 1 (a) What is the difference between supervised and unsupervised learning? Provide an example of each. What is meant by the term semi-supervised learning? (8 marks) In supervised learning, the information regarding the target/desired output is available and exploited during training. On the other hand, in unsupervised learning this information is not available. [3] Examples: linear regression (supervised) and clustering (unsupervised). [2] In semi-supervised learning a small amount of labelled training data is available, but usually a large quantity of unlabelled data. Learning can take advantage of the structure of the data implicit in the unlabelled data as well as the labels. [3] (b) Briefly explain what is meant by the kernel trick and how it may be used to obtain a nonlinear classifier from a linear one. Name and provide expressions for at least two (positive definite) kernel functions. (10 marks) The kernel trick describes the fact that inner products 〈φ(x), φ(y)〉 [2] between vectors x and y mapped to a (usually high dimensional) feature space by φ(·) can be computed via a kernel k(x,y) in the original space (subject to the kernel being a Mercer kernel) [3]. If the computation of the linear classifier can be done via inner products, then the classification can be performed after nonlinearly mapping φ(x) to a high-dimensional space, but the computations can be achieved in the original space. The basis of support vector machines and kernel logistic regression. [3] Two well-known (positive definite) kernel functions are [2]: • Polynomial: k(x, z) = (x · z+ v)p, v ≥ 0, p ≥ 0 • RBF: k(x, z) = exp(−γ × ‖x− z‖22), which is known as Gaussian RBF when γ = 1 2σ2 (c) Explain the difference between covariance and correlation. The covariance ECM3420 (2019) 1 AN SW ER S ECM3420 – 28/Jan/2019 – with answers Page 2 matrix for two variables is given by S = [ 10.2 −45.6 −45.6 8.1 ] Calculate the correlation between them and provide comments on the outcome. Briefly explain why two decorrelated variables need not be independent. (8 marks) Correlation is a standardised measure of covariance [2]. They are related by ρ = covxy/(σxσy), where σx, σy denote the standard deviations of x and y, respectively. So ρ = −45.6/(10.2 × 8.1) = −0.55. [2] Identification of correct elements in covariance matrix and calculation. Therefore, the two variables are moderately anti-correlated [2]. Correlation is a linear measure so the variables might be positively correlated in one place and negatively correlated in another summing to zero; independence requires p(x, y) = p(x)p(y) which is stronger [2]. Standard material, but not all appreciate the differences. (d) Explain what is meant by the terms generalisation error and over-fitting. How can over-fitting be combatted? (6 marks) Generalisation error: The average error made on unseen test data after training. [2] Overfitting: The phenomenon of learning the details of the training data rather than the trends. [2] Combatted by controlling the complexity of the learning machines. [2] (e) With the aid of a carefully labelled diagram, explain what is meant by a convex function. Why is it desirable for an error function to be convex? (8 marks) ECM3420 (2019) 2 Please Turn Over AN SW ER S ECM3420 – 28/Jan/2019 – with answers Page 3 Several possible definitions in terms of the secant line or second derivative for functions of one variable. For multi-variate functions the eigenvalues of the Hessian are all positive or the epigraph of the function is a convex set. [4] for a multivariate definition ; for a univariate definition. Diagram: [2]. Desirable to have a convex error function because convex functions have only one minimum. Therefore if a minimum is located, it is the global minimum. [2] (Total 40 marks) ECM3420 (2019) 3 AN SW ER S ECM3420 – 28/Jan/2019 – with answers Page 4 SECTION B (Optional Questions) Question 2 (a) Write the equation and draw a diagram describing the perceptron model. Specify all variables involved in the model (using formal notation) and describe their role. (10 marks) The equation describing a perceptron reads: y = φ(v) = φ ( d∑ i=1 wixi + b ) . [4] Let us describe the elements involved in the above expression:[6] • x = [x1, x2, …, xd]T is the input vector (feature vector); • w = [w1, w2, …, wd]T are the synaptic weights; • b ∈ R is called bias; • v = ∑d i=1wixi + b is called local field or pre-activation; • φ : R→ R is called activation (or transfer) function; • y = φ(v) is the neuron output. (b) Write the equations describing a three-layer MLP. Specify all variables and functions involved in the model and describe their role. Assume one hidden layer with L neurons and one output neuron. (10 marks) We assume bias is included in the network weights. Let us start from the output neuron: y = φ(v), v = L∑ i=0 xiαi. (1) [3] Each component xi in Eq. 1 is the output y (H) i of the activation function of a neuron in the hidden layer, i.e., xi ≡ y(H)i [3]. Therefore, a ECM3420 (2019) 4 Please Turn Over AN SW ER S ECM3420 – 28/Jan/2019 – with answers Page 5 three-layer MLP can be described as follows: y = φ  L∑ i=0 αi φ ( d∑ j=0 wjixj ) ︸︷︷︸ y (H) i  , (2) where d is the input dimensionality. [4] (c) Write the update rule and describe the Least-Mean-Square (LMS) algorithm. (10 marks) Least-Mean-Square (LMS) is an online learning algorithm that is used for finding the weights of a linear model. LMS updates the solution after the presentation of each input datum. [5] The LMS update rule reads: w(k + 1) = w(k) + µx(k)e(k), (3) where w(k + 1) denotes the updated weights at time-step k + 1, w(k) are the current weights, µ > 0 is the learning rate, x(k) is the current input, and finally e(k) = d(k)− y(k) is the instantaneous error given by the difference between desired and computed outputs, respectively. [5] (Total 30 marks) ECM3420 (2019) 5 AN SW ER S ECM3420 – 28/Jan/2019 – with answers Page 6 Question 3 A start-up wishes to use machine learning to predict the polluting emissions in car exhaust from the characteristics of the car, such as engine size, fuel, driving speed, etc. They have collected a large data set of measured emissions and the corresponding features for each measurement. (a) In supervised learning, an error function, E(w), may be defined as: E(w) = − N∑ n=1 log p(tn |xn;w). State the meaning of each of the symbols w, N,xn and tn in this definition and carefully explain how the error function may be derived from the principle of maximum likelihood. (11 marks) • w Parameters/weights of the model. [1] • N Number of training data pairs. [1] • xn Training features. [1] • tn Training targets. [1] The likelihood is ∏N n=1 p(tn,xn |w). [1] Taking negative logarithms and assuming that the targets are independent [2], and writing p(tn,xn |w) = p(tn |xn,w)p(xn) [2] obtains the error function to be minimised (because log is monotonic [1]) if the constant due to p(xn) is dropped. [1] Straightforward, but many find it difficult. (b) The start-up uses a radial basis function neural network with a mean squared error function to learn the relationship between features and emissions. Explain what the use of the mean squared error function implies about the noise or measurement errors in the data. (4 marks) The mean squared error function is derived from the assumption that t
he errors between the systematic model and the measurements are Gaussian distributed. [4] Discussed in lectures. ECM3420 (2019) 6 Please Turn Over AN SW ER S ECM3420 – 28/Jan/2019 – with answers Page 7 (c) Explain how the parameters of the radial basis function network may be determined with the mean squared error function. (7 marks) Learning by selecting a number of data points as RBF centres and then linear regression from the basis function activations. [3] This requires constructing a design/data matrix X whose n, j-th entry is the activation of basis function j by feature xn. Weights w found by w = X†y where y is the vector of targets. [2] Regularisation constant found by cross validation. [2] Discussed in lectures and illustrated in a workshop. (d) In order to cope with occasional outliers the start-up decides to use a sum of absolute errors function. Explain why the sum of absolute errors function may help in this context. (4 marks) Sum of absolute errors function is corresponds to Laplacian distributed errors [2] p(t |x;w) ∝ exp(−|t − y(w)|) so large deviations from the prediction are more likely and will not have such a large influence on the weights as for the Gaussian case. [2] (e) Suggest how the parameters of the radial basis function network could be learned using the sum of absolute errors. (4 marks) Numerical optimisation rather than linear algebra must be used. [2] Either Nelder-Mead simplex method or a gradient-based method. [2] This was illustrated in a workshop for ordinary linear regression, but most candidates will not put the RBF training and optimisation ideas together. (Total 30 marks) ECM3420 (2019) 7 AN SW ER S ECM3420 – 28/Jan/2019 – with answers Page 8 Question 4 (a) Explain what is meant by the term recurrent neural network and what recurrent neural networks are used for. (4 marks) Recurrent neural networks feed the output signal back to the inputs. [2] Used for modelling and predicting time series. [2] Bookwork (b) Illustrating your answer with a diagram, explain what it meant by an echo state network. Your answer should include equations giving the update rule and output for an echo state network. (11 marks) Diagram showing input layer, reservoir, output layer and feedback. [4] Discussed in lectures Update rule: xt =φ(W rxt−1 +Wiut + ), zt =W oxt, [2] where φ(·) is the activation function (typically, but not necessarily, a sigmoidal function) xt is the ESN state at time t, ut and zt are the input and output at time t, respectively, and is an additive white noise term. Wr are the weights of the recurrent layer (connections between hidden neurons) and Wi are weights between input and recurrent layer. [2]Wo weights connect the ESN state with the output and form the so- called read-out layer. [3] Standard material. (c) Describe in detail how the read-out weights of echo state networks are obtained. You should include in your answer the expressions describing the related optimization problem and how it is solved. (11 marks) ECM3420 (2019) 8 Please Turn Over AN SW ER S ECM3420 – 28/Jan/2019 – with answers Page 9 The read-out weights Wo are obtained by solving the following regularized [1] least-square problem [2]. argmin W∈RNr×No 1 2 ‖XW − t‖2 + λ 2 ‖W‖2, [3] where λ ≥ 0 is the regularization parameter [1] and X ∈ RN×Nr contains the ESN states produced by the action of an input of size N and t ∈ RN is a vector containing the target values. [2] Closed-form solution for the solution is obtained as: Wo = ( X>X+ λ2I )−1 X>t, (4) where I is a Nr ×Nr identity matrix. [2] (d) Explain why regularisation is used in finding the read-out weights. (4 marks) Regularization improves the numerical stability of the matrix inversion operation and reduces the magnitudes of entries in Wo, thus mitigating sensitivity to noise and overfitting. [4] (Total 30 marks) ECM3420 (2019) 9 AN SW ER S ECM3420 – 28/Jan/2019 – with answers Page 10 Question 5 (a) Explain the difference between a hierarchical clustering algorithm and a clustering algorithm that yields a partition. Explain when you might prefer to use a hierarchical algorithm. (5 marks) Hierarchical algorithm gives clusters at different resolutions while a partition just divides the data into a fixed number of clusters [3]. Hierarchical algorithms useful for examining structure of data at different resolutions and when the number of clusters is unknown. [2] (b) Explain how agglomerative clustering is performed. (6 marks) Agglomerative clustering begins with each data point as a single cluster. [1] At each stage the two closest clusters are merged to form a new cluster. [2] Algorithm terminates when a single cluster remains. [1] Distances between clusters can be measured in a variety of ways, for example, as the minimum or maximum distance between any two points, one from each cluster. [2] Standard, but many will not appreciate different distance measures (c) Carefully defining the symbols you use, give an expression for the objective function minimised by the k-means algorithm. (6 marks) Objective function: k∑ j=1 ∑ xi∈Cj ‖xi − µj‖22, [2] where ‖·‖22 is the squared Euclidean norm, Cj is the jth cluster in the partition P , k = |P|, and µj, j = 1, …, k, are k cluster representatives. [2] If clusters are represented by centroids, then µj is defined as: µj = 1 |Cj| ∑ xi∈Cj xi [2] ECM3420 (2019) 10 Please Turn Over AN SW ER S ECM3420 – 28/Jan/2019 – with answers Page 11 (d) Giving pseudo-code, explain how the k-means clustering is achieved. (7 marks) Here is detailed pseudo-code, but full marks for less formal presentation that includes initialisation [2], assignment [2] and update steps [2] and termination [1] steps. Input: Input data X = {x1, …, xn}, partition order k, a distance function d : X × X → R+0 , MAX number of iterations Output: A partition P = {C1, …, Ck} 1: t = 0,P = ∅ 2: Initialisation: Assign k cluster representatives as randomly chosen xi 3: loop 4: t = t+ 1 5: Assignment Step: assign each sample xj to the cluster with the closest representative 6: C (t) i = {xj : d(xj, µi) ≤ d(xj, µh) for all h = 1, . . . , k} 7: Update Step: update the representatives 8: µ (t+1) i = 1 |C(t)i | ∑|C(t)i | j=1 xj 9: Update the partition with the modified clusters: P t = {C(t)1 , …, C(t)k } 10: if t ≥ MAX OR P t = P t−1 then 11: return P t 12: end if 13: end loop (e) Two users run the k-means algorithm on the same data set, but find that it gives two different partitions. Explain how this may occur and what may be done in practice to cope with this phenomenon. (6 marks) The reason is that the clusters are initialised at random so the algorithm may converge to different local minima. [3] In practice the algorithm is run several times from different initialisations and the partition with the minimum objective function returned. [3] Standard, but not appreciated by many. (Total 30 marks) ECM3420 (2019) 11 End of Paper

辅导案例-ECM3420

Related

Previous Post辅导案例-CSCI-570

Next Post辅导案例-ELEC3575

Author admin