Skip to main content
留学咨询

辅导案例-GR5241

By May 15, 2020No Comments

STAT GR5241 Spring 2020 Final Name: UNI: The final exam is open notes, open book and open internet. Students are not allowed to communicate with anyone about the final with the exception of their respective TA or the instructor. Please solve the exam on your own paper and upload the completed final as a single pdf document on Canvas by Saturday, May 9th at 1:00pm (NYC time). See the final exam submission guidelines for more information. Late exams will not be accepted and students who don’t follow the submission guidelines will be significantly penalized. Cheating will result in a score of zero and lead to the letter grade F. Every student must write the following phrase on their cover page. Please fill in your printed name, signature and date. “I will not engage in academic dishonest activities for the STAT GR5241 final exam. Signature: Date: ” Problem 1: AdaBoost [20 points] In this problem we perform the adaboost classification algorithm described in the lecture notes and homework 5. The dataset of interest consists of n = 100 training cases, p = 2 continuous features x1, x2, and dichotomous response Y labeled Y = −1 and Y = 1. The training data is displayed in Figure 1, which shows x2 versus x1 split by Y . Figure 1 also displays the location of a single test case xtest and training case x17. The weak learner is a decision stump, i.e., for the bth boosting iteration: cb(x|j, θ) = { +1 if xj > θ −1 if xj ≤ θ The decision stump is trained by minimizing weighted 0-1 loss: (j(b), θ(b)) := argmin j,θ ∑n i=1 wiI{yi 6= cb(xi|j, θ)}∑n i=1 wi 1 Figure 1: Problem 1 training data Our adaboost classifier is constructed from B = 8 weak learners. For boosting iterations b = 1, 2, . . . , 8, the trained decision stumps c1, c2, . . . , c8 are: (j(1) = 2, θ(1) = 0.5016) (j(2) = 1, θ(2) = 0.6803) (j(3) = 2, θ(3) = 0.9128) (j(4) = 2, θ(4) = 0.5016) (j(5) = 1, θ(5) = 0.4986) (j(6) = 1, θ(6) = 0.9278) (j(7) = 2, θ(7) = 0.5016) (j(8) = 1, θ(8) = 0.4986) The weighted errors b are: 1 = 0.1400 2 = 0.1221 3 = 0.2018 4 = 0.1784 5 = 0.1475 6 = 0.1670 7 = 0.1760 8 = 0.1821 2 Solve problems (1.i)-(1.ii): 1.i [10 points] Classify the test case xtest = ( 0.600 0.395 ) based on the trained adaboost model. Use all B = 8 weak learners to estimate Yˆtest. Show the details of your calculation for full credit. 1.ii [10 points] Does the structure of the training data (Figure 1) provide an advantage or disadvantage when using adaboost with decision stumps? Describe your answer in a few sentences. Problem 2: Neural Networks [30 points] In this problem we fit a neural network to perform classification on the same n = 100 training data from Problem 1. The dichotomous response Y is labeled Y = 0 and Y = 1, instead of −1 and 1. Our neural network consists of a single hidden layer with d = 2 derived features, sigmoid activation function and softmax output function. The schematic of this model is displayed below: b1 b2 x1 h1 ya x2 h2 yb b [1] 1 w [1] 11 w [1] 12 b [1] 2 w [1] 21 w [1] 22 b [2] 1 w [2] 11 w [2] 12 b [2] 2 w [2] 21 w [2] 22 The objective or total cost Q(θ) is cross-entropy: Q(θ) = − n∑ i=1 ( yi log f1(xi) + (1− yi) log f2(xi) ) = n∑ i=1 Qi(θ) Please Read: regarding the above notion, f1(xi) represents the first row and i th column of matrix P from the lecture notes (NN Backpropagation.pdf). Similarly f2(xi) represents the second row and i th column of matrix P. I should have been more careful when defining cross-entropy in that set of notes. However, the forward pass and backpropagation algorithms provided are still correct. 3 The neural network is trained by minimizing Q(θ) with respect to θ, where parameter θ is the collection of weights and biases W[1],b[1],W[2] and b[2]. The quantity Qi(θ) represents the i th training case’s contribu- tion to total cross-entropy Q(θ). The neural network is trained via gradient descent, yielding the estimated weights: bˆ[1] = ( bˆ [1] 1 bˆ [1] 2 ) (2×1) = (−5.7840 3.4478 ) , Wˆ[1] = ( wˆ [1] 11 wˆ [1] 12 wˆ [1] 21 wˆ [1] 22 ) (2×2) = (−17.7838 43.4752 −7.5506 −0.5264 ) bˆ[2] = ( bˆ [2] 1 bˆ [2] 2 ) (2×1) = (−2.0010 2.0030 ) , Wˆ[2] = ( wˆ [2] 11 wˆ [2] 12 wˆ [2] 21 wˆ [2] 22 ) (2×2) = ( 25.6367 −66.6665 −26.3790 68.6225 ) Solve problems (2.i)-(2.iv): The following problems can be simplified using matrix multiplication. If you solve these questions in R or Python, please include all relevant steps in your handwritten solutions but do not include the code. 2.i [10 points] Classify the test case xtest = ( 0.600 0.395 ) based on the trained neural network. Show the details of your calculation for full credit. 2.ii [5 points] The 17th case of the training data is x17 = ( 0.4474 0.8764 ) with label y17 = 0. Compute the cost of case 17, i.e., compute Q17(θˆ). 2.iii [10 points] Note that the gradient of Q(θ) can be expressed ∇Q(θ) = n∑ i=1 ∇Qi(θ). Use the backpropagation algorithm to compute ∇Q17(θˆ). This quantity represents the 17th case’s contribution to the full gradient of Q(θ), evaluated at the point θ = θˆ . Show the details of your calculation for full credit. 2.iv [5 points] What should ∇Q(θˆ) approximately equal? Justify your answer in one or two sentences. Problem 3: Smoothing Splines [10 points] Consider two curves, gˆ1 and gˆ2, defined by gˆ1 = arg min g ( n∑ i=1 (yi − g(xi))2 + λ ∫ [g(3)(x)]2dx ) , gˆ2 = arg min g ( n∑ i=1 (yi − g(xi))2 + λ ∫ [g(4)(x)]2dx ) , 4 where g(m) represents the mth derivative of g and λ > 0 is the tuning parameter. Similar to linear regression, the training and testing RSS are respectively defined RSStrain = n∑ i=1 (yi − gˆ(xi))2 RSStest = n∗∑ i=1 (y∗i − gˆ(x∗i ))2, where gˆ is fit on the training data. Solve problems (3.i)-(3.ii): 3.i [5 points] As λ → ∞, will gˆ1 or gˆ2 have the smaller training RSS, or there is no definite answer? Describe your answer in a few sentences. 3.ii [5 points] As λ→∞, will gˆ1 or gˆ2 have the smaller test RSS, or there is no definite answer? Describe your answer in a few sentences. Problem 4: Optimization [10 points] Consider the L2-penalized logistic regression model having p features. For simplicity, we exclude the inter- cept and do not standardize the features. The objective or cost function is Q(β) = Q(β1, β2, · · · , βp) = 1 n L(β1, β2, · · · , βp) + λ p∑ j=1 β2j , where L(β1, β2, · · · , βp) is the negative log-likelihood and λ > 0 is the tuning parameter. For fixed λ > 0, our goal is to estimate β = ( β1, β2, · · · , βp )T . Solve the following problem: Derive the update step for Newton’s method: β(t+1) := β(t) − [HQ(β(t))]−1 · ∇Q(β(t)) Note: students are welcome to use relevant results and simplifications directly from the lecture notes. Problem 5: EM Algorithm and Clustering [10 points] For fixed K > 0, let x1, x2, . . . , xn be iid cases from the mixture distribution pi(x) = K∑ k=1 ckp(x|µk), where ∑ k ck = 1, ck ≥ 0 and p(x|µk) is the exponential density p(x|µk) = 1 µk exp ( − x µk ) , x > 0. 5 Here we are clustering the cases based on an exponential mixture model. Solve the following problem: Write down the EM algorithm for estimating parameters c1, c2, . . . , cK and µ1, µ2, . . . , µk. Note: this is an easy problem by exploiting properties of exponential families. Problem 6: True or False [20 points] For this section, please clearly circle either TRUE or FALSE for each question. In your case, please write down TRUE or FALSE on your submitted exam. Ambiguous choices will be marked incorrect. There is no additional work required to justify each answer. 6.i In logistic regression, negative log-likelihood Q(β) is a convex function of β: TRUE FALSE 6.ii Consider a neural network with one hidden layer, sigmoid activation and softmax output function. The total cost Q(θ) (cross-entropy) is a convex function of θ: TRUE FALSE 6.iii Consider a Gaussian mixture model with negative log-likelihood Q(θ) = − n∑ i=1 log ( K∑ k=1 ckp(xi|µk,Σk) ) , where θ is the collection means µ1,µ2, . . . ,µK , covariances
Σ1,Σ2, . . . ,ΣK and mixture parameters c1, c2, . . . , cK (see EM/clustering lecture notes for details). The negative log-likelihood is a convex function of θ: TRUE FALSE 6.iv Let x,x′ ∈ Rd and assume that k1(x,x′) and k2(x,x′) are valid kernels. The function k(x,x′) = k1(x,x′)− k2(x,x′) is a valid kernel: TRUE FALSE 6.v Let x,x′ ∈ Rd. The function k(x,x′) = 1 + xTx′ is a valid kernel: TRUE FALSE 6 6.vi Consider L1-penalized linear or logistic regression (LASSO). As λ > 0 increases, the L1 penalty forces some coefficients βj to be exactly 0 and the remaining non-zero coefficients are the same estimates produced from unpenalized regression: TRUE FALSE 6.vii Consider L2-penalized linear or logistic regression (Ridge). Larger values of λ > 0 will typically decrease the bias: TRUE FALSE 6.viii Consider linear regression (OLS) with p > 0 features. Larger values of p will typically decrease the bias: TRUE FALSE 6.ix Consider k-NN classification or regression. Larger values of k > 0 will typically decrease the bias: TRUE FALSE 6.x Consider a model with tuning parameter α. The k-fold cross-validation error CV (fˆ , α) = 1 K K∑ k=1 1 |Bk| ∑ (x,y)∈Bk L(y, fˆ−k(x, α)) is an unbiased estimator of the true generalization error of fˆ : TRUE FALSE 7

admin

Author admin

More posts by admin