• May 15, 2020

C:\Users\92466\Desktop\COMP3206W1_COMP3206_201718_01UNIVERSITY OF SOUTHAMPTON COMP3206W1SEMESTER 1 EXAMINATION 2017-2018 MACHINE LEARNINGDURATION 120 MINS (2 Hours)All answers must be written within the designated spaces in this booklet.No credit will be given for answers presented elsewhere.You are advised to write using a soft pencil so that you may readily correct mistakes with an eraser.This paper contains 5 questions.Answer ANY FOUR, each worth 25% of the total marks of the examination.An outline marking scheme is shown in brackets to the right of each question. This examination provides 80% of the module’s marks.University approved calculators MAY be used.A foreign language translation dictionary (paper version) is permitted provided it contains no notes, additions or annotations2 COMP3206W1This page is intentionally left blank—do not write in this space •COMP3206W11i=1You are given a dataset {(xi, yi)}Nwhere xi is a p-dimensional vector ofreal-valued features associated with an individual i and yi is a real-valued“output” variable for i. Show how you would set up a linear regressionmodel by expressing yi as a function of xi. [4 marks]Express the learning problem in terms of a design matrix A, so that Aw =y. You must outline what the size of the design matrix is, and what each of its rows represents. The vector of residuals is orthogonal to the column space of A. Show how this leads to the solution which is written out in terms of the pseudo-inverseA+ = .AT A.—1 AT .[6 marks]TURN OVERCOMP3206W1If the design matrix has a singular value decompositionkA = U ⌃V T = X ukokvTkwhere U = (u1, ·· · , ur ) is a matrix composed of N ⇥ 1 column vectors andV = (v1, ·· · , vr ) is a matrix of p-dimensional column vectors, express A+in terms of U , V and ⌃. Rewrite y, w as linear combinations of the columnsui, vi of the matrices U , V respectively:w = X ↵ivi, y = X Øiuii ito express w in terms of the elements in the SVD. Comment on how the singular values in ⌃ might affect the outcome of the learning. [10 marks]COMP3206W1Show how the solution to the linear regression problem can be obtained from an optimisation approach where a weight vector w is chosen to min- imise the loss functionl(w) = (Aw — y)T (Aw — y)+ LwT w.Why is the ‘regularization’ term involving L, that was not indicated in the dataset (xi, yi), introduced? [5 marks]TURN OVERCOMP3206W12For a probability distribution p over an event space X p := {pi|i 2 X }, explain why the entropy H(p)1H(p) = X pi log( )ipiis often interpreted as the average degree of ‘surprise’. [3 marks]For 2 probability distributions p = {pi|i 2 X } and q = {qi|i 2 X } over the same event space X the cross-entropy H(p, q) and the Kullback-Leibler (KL) divergence KL(pkq) are defined as:1H(p, q) = X pi log(iqiXpiq) and KL(pkq) = H(p, q) — H(p) = pi log( ).i iProvide some intuition for what each of the two metrics over pairs of prob- ability distributions captures. You may treat the distribution p as the one representing ‘ground truth.’ [7 marks]COMP3206W1You are given a sequence x := {x(1),. .., x(N )} of heads (x(i) = H) and tails (x(i) = T ) which are the outcomes of N tosses of a (potentially biased) coin. All possible outcomes of N coin tosses would constitute the event space X . A binomial distribution B(N, ✓) sets the probability of occurrence of n1events of type 1, and n2 = N — n1 of type 2 asN ! n1 n2nP (n1, n2|N, ✓) = 1✓!n2!(1 — ✓) .Describe how you would fit the data X to a binomial distribution using maxi- mum likelihood estimation and find the result to be the same as the empirical distribution p˜ = (p˜H , p˜T ):nH✓⇤ =N=: p˜H. [8 marks]TURN OVERCOMP3206W1Use Bayes’ theorem to write the posterior probability P (✓|x) of the parame- ters ✓ upon observing some data x. Discuss how a conjugate Beta prior✓↵—1(1 — ✓)Ø—1R 1F(↵ + Ø)= ✓F(↵)F(Ø)↵—1(1 — ✓)Ø—10 ✓↵—1(1 — ✓)Ø—1d✓introduces “psuedo-counts” to affect the estimation of parameters of the binomial distribution as above, and combats the problem of overfitting, par- ticularly for data sets of small size N .(You won’t need this information, but the definition of F(x) is given below.)Z 1F(x) = 0tx—1e—tdt, and F(n) = (n — 1)! for integer n.[7 marks]COMP3206W13i=1You are given a dataset X = {xi}Nwhere X is a (N ⇥ p) matrix, each rowof which xi is a p-dimensional vector of real-valued features associated withan individual i. Explain in detail how you would use Principal ComponentsAnalysis (PCA) to reduce the dimensionality p of the feature set. You have tointroduce the covariance matrix, (ii) discuss what role its eigenvalues play,(iii) explain how to perform the projection onto a low-dimensional subspace and (iv) comment on the magnitudes p and N . [12 marks]TURN OVERCOMP3206W1For classification problems where there is a training set of labelled data, explain why dimensionality reduction using PCA may give rise to poor clas- sification outcomes. You may draw a figure to illustrate your arguments.[4 marks]2A uniform random variable U (a, b) has mean a+b122and variance (b—a). Forany sample x of N numbers x = {x1, x2,. .., xN } drawn from a uniform distribution U (0, 2) its average < x > is a random variable. What is the mean and variance of this random variable? What is the relevance to ma- chine learning implied by the dependence on sample size N of the result.[4 marks]COMP3206W1o2A 2-dimensional Gaussian distribution is defined by the probability density function of X = (X1, X2)T parameterised by its mean µ = (µ1, µ2)T andvariance-covariance matrix ⌃ =1 ⇢o1o2 ◆.2⇢o1o2 o21 ✓ 1◆T 112—p(X|µµ, ⌃) = 2⇡o o p1exp⇢2— 2 (X — µ) ⌃(X — µ) .Draw 3 contour plots of equiprobable values of (X1, X2) for the cases (a)⇢ = 0, (b) ⇢ < 0 and (c) ⇢ > 0 providing reasons for doing so. [5 marks]TURN OVERCOMP3206W14Describe the k-means clustering algorithm. [7 marks]COMP3206W1AIn Fisher’s Linear Discriminant Analysis (LDA), training data – xiBand xjfrom 2 classes A and B – is projected along some vector w so that xi , xjA Bare mapped intoyi T ij T jA := wxA,i = 1,. .., NA, and yB := wxB ,j = 1,. .., NBrespectively. Mean vectors and variance-covariance matrices are computed from the training data. These empirical means are denoted mA and mB and the variance-covariance matrices are called SA and SB respectively.Let µA and µB be the (scalar) means for the values {yi } and {yj }, andA BoA, oB the corresponding variances. These will, of course, depend on thechoice of vector w. What is the quantity that LDA seeks to maximise inorder to achieve classification accuracy? Explain why that is a good choice.[5 marks]Express this quantity as a ratio of the form:wT Uw R(w) = wT Vw by finding the suitable U , V . [4 marks]TURN [email protected](w)@w2= (wT V w)2.(wT V w)Uw — (wT U w)V that the optimal vector w is in the direction of V —1(mA — mB ). If the data from each class A and B were scattered isotropically, V would be proportional to the identity matrix. Hence, interpret the result. [4 marks]COMP3206W1You are given a scatter plot of 2-dimensional labelled data (labels “x” and “o”) as shown in the figure, and equiprobable contour plots of their best fit Gaussian distributions.6o4o xoo2 o oo o ooox xxox x0 o oo x xxo xox xoo oxxo-2 xo o x o x xx xx x-4 x xo x-6-6 -4 -2 0 2 4 6Explain how you would find an optimal direction for projecting the data onto to linearly separate the data into two classes. You may use the answers to the previous parts to comment on your choices. [5 marks]TURN OVERCOMP3206W15 Suppose you are given the 2 class dataset in the figure below.+ Class 1* Class 2 * **+ * *+*** *++ ++ +Would a perceptron be a suitable classifer for this dataset? Briefly explain your answer [3 marks]Would a neural network with 1 hidden layer and no bias be a suitable clas- sifier for this dataset? Briefly explain your answer [3 marks]COMP3206W1Assume you now have a fully functional neural network architecture in place and you are training your model. How can you check to see if you are overfitting? Explain exactly what you could plot from your output and what on this plot may indicate overfitting. [7 marks]NOTE: THERE ARE 3 MORE PARTS TO THIS QUESTION [(d), (e) and (f)]on the next page!!TURN OVERCOMP3206W1Now assume you are working with a different dataset of 1000 images. These images are from the MNIST dataset and consist of handwritten digits from 0 to 9. Some samples are given below. Each one of the images consists of 28 x 28 pixels each with a value of 0 or 1 and you want to build a neural network to classify the digits. Note, for any of the answers below, if you donot have a calculator and cannot work out the final number you will not be penalised. Simply write out the simplest form of the equation to the answer of the question.How many inputs would you need in your neural network? [3 marks]How many neurons would you need in your output layer? [3 marks]Explain how you would know if your model has converged? [6 marks]END OF PAPER