University of Waterloo CS480/680 2021 Winter CS480/680: Introduction to Machine Learning Homework 1 Due: 11:59 pm, January 28, 2021, submit on Crowdmark (yet to be set up, stay tuned). Last Updated: January 17, 2021 Include your name and student number! Submit your writeup in pdf and all source code in a zip file (with proper documentation). Write a script for each programming exercise so that the TAs can easily run and verify your results. Make sure your code runs! [Text in square brackets are hints that can be ignored.] Exercise 1: Perceptron Implementation (5 pts) Convention: All algebraic operations, when applied to a vector or matrix, are understood to be element-wise (unless otherwise stated). Algorithm 1: The perceptron algorithm. Input: X ∈ Rn×d, y ∈ {−1, 1}n, w = 0d, b = 0, max pass ∈ N Output: w, b,mistake 1 for t = 1, 2, . . . ,max pass do 2 mistake(t)← 0 3 for i = 1, 2, . . . , n do 4 if yi(〈xi,w〉+ b) ≤ 0 then 5 w← w + yixi // xi is the i-th row of X 6 b← b+ yi 7 mistake(t)← mistake(t) + 1 Implement the perceptron in Algorithm 1. Your implementation should take input asX = [x>1 , . . . ,x > n ] > ∈ Rn×d, y ∈ {−1, 1}n, an initialization of the hyperplane parameters w ∈ Rd and b ∈ R, and the maximum num- ber of passes of the training set [suggested max pass = 500]. Run your perceptron algorithm on the spambase dataset (use the version on the course website), and plot the number of mistakes (y-axis) w.r.t. the number of passes (x-axis). Ans: Exercise 2: Perceptron Questions (5 pts) 1. (3 pts) The perceptron algorithm makes an update every time it witnesses a mistake. What if it makes an update on every point, regardless of whether or not that point is correctly classified? For simplicity, consider a setting where where b is fixed to 0. Give an example of an infinite sequence of points (xi, yi) with the following properties: • The sequence is strictly linearly separable with b = 0 (i.e., the margin is some constant γ > 0), • The sequence has max ‖xi‖2 ≤ 1, • The modified perceptron algorithm makes an infinite number of mistakes on this sequence. Prove that it has these properties. Note that the perceptron convergence theorem and the first two conditions imply that, at some point, the unmodified perceptron algorithm would only make a finite number of mistakes. Ans: 2. (1 pt) Give examples of where the perceptron algorithm converges to a 0 margin halfspace, and a maxi- mum margin halfspace. Ans: 3. (1 pt) Suppose that in each iteration of the perceptron algorithm, a point is chosen uniformly at random from the dataset. Show how the perceptron algorithm can be viewed as an instantiation of stochastic gradient descent (SGD). In particular, you must provide a loss function and a learning rate such that SGD with these parameters and perceptron are identical. Gautam Kamath ([email protected]) ©2021 University of Waterloo CS480/680 2021 Winter Ans: Exercise 3: Regression Implementation (8 pts) Recall that ridge regression refers to min w∈Rd,b∈R loss︷ ︸︸ ︷ 1 2n‖Xw + b1− y‖22︸ ︷︷ ︸ error +λ‖w‖22, (1) where X ∈ Rn×d and y ∈ Rn are the given dataset and λ > 0 is the regularization hyperparameter. 1. (1 pt) Show that the derivatives are ∂ ∂w = 1nX >(Xw + b1− y) + 2λw (2) ∂ ∂b = 1n1 >(Xw + b1− y). (3) Ans: 2. (2 pts) Implement the gradient descent algorithm for solving ridge regression. The following incomplete pseudo-code may of help. Test your implementation on the Boston housing dataset (to predict the median house price, i.e., y). Use the train and test splits provided on course website. Try λ ∈ {0, 10} and report your training error, training loss and test error. [Your training loss should monotonically decrease during iteration; if not try to tune your step size η, e.g. make it smaller.] Algorithm 2: Gradient descent for ridge regression. Input: X ∈ Rn×d, y ∈ Rn, w0 = 0d, b0 = 0, max pass ∈ N, η > 0, tol > 0 Output: w, b 1 for t = 1, 2, . . . ,max pass do 2 wt ← 3 bt ← 4 if ‖wt −wt−1‖ ≤ tol then // can use other stopping criteria 5 break 6 w← wt, b← bt Ans: For the next part, you may use the Python package scikit-learn. 3. (5 pts) Train (unregularized) linear regression, ridge regression, and lasso on the mystery datasets A, B, and C on the course website (using X train and Y train for each dataset). For ridge regression and lasso, use parameters 1 and 10 (note that you will have to divide these by n for lasso in scikit-learn – why?). Report the average mean squared error on the test set for each method. Which approach performs best in each case? Plot the five parameter vectors obtained for each dataset on the same histogram, so they’re all visible at once (change the opacity setting for the bars if necessary): specifically, for each parameter vector, plot a histogram of its value in each coordinate. Given which approach performs best, and how its parameter histogram looks, how do you suspect the true parameter vector and responses might have been generated? Ans: Exercise 4: An Alternative to Least Squares? (3 pts) Suppose we are in a setting with a dataset X ∈ Rn×d, and labels are generated according to Y = XW+ε, where W ∈ Rd, Y ∈ Rn, and ε ∈ Rn is a random vector, where all entries are independent random variables with 0 Gautam Kamath ([email protected]) ©2021 University of Waterloo CS480/680 2021 Winter mean and variance σ2 (you can imagine Gaussian, but it isn’t necessary). As we saw in class, the least-squares solution Wˆ can be written as Wˆ = (XTX)−1XTY – this is a linear transformation of the response vector Y . Consider some different linear transformation ( (XTX)−1XT +N ) Y , where N ∈ Rd×n is a non-zero matrix. 1. (1 pt) Show that the expected value of this linear transformation is (Id + NX)W . Conclude that its expected value is W if and only if NX = 0. (1 pt) Ans: 2. (2 pts) Compute the covariance matrix of this linear transformation when NX = 0, and show that it is equal to σ2(XTX)−1 + σ2NNT . Since the former term is the covariance of the least squares solutiona and the latter matrix is positive semi-definite, this implies that this alternative estimator only increased the variance of our estimate. Ans: aVerify this for yourself, but no need to submit it. Exercise 5: Sample Statistics (2 pts) 1. (1 pt) Suppose there is a dataset x1, . . . , xn sampled from a distribution with mean µ and variance σ 2. Compute the expected value of the sample mean: x¯ = 1n ∑n i=1 xi. Describe any modifications that might be required to make the expected value µ (recall that µ and σ2 are unknown). Ans: 2. (1 pt) Suppose there is a dataset x1, . . . , xn sampled from a distribution with mean µ and variance σ 2. Compute the expected value of the sample variance: 1n ∑n i=1 (xi − x¯)2, where x¯ is the sample mean from the previous part. Describe any modifications that might be required to make the expected value σ2 (recall that µ and σ2 are unknown). Ans: Gautam Kamath ([email protected]) ©2021 欢迎咨询51作业君