- June 13, 2020

The University of Sydney Page 1 STAT5003 Week 13 Review and Final Exam Presented by Dr. Justin Wishart The University of Sydney Page 2 Exam format – Two hour written exam – 20 Multiple Choice questions – Questions can have one or two correct answers. You need to select the exact correct answer(s) to get a mark – Some short answer questions – Two longer answer questions The University of Sydney Page 3 Topics covered – Everything in the lectures/tutorials from Weeks 1 to 12 (except any topic that was marked as not examinable) – Writing R code is not tested, but there could be questions on interpreting R outputs – You should understand how the algorithms work and be able to sketch out the key steps in pseudo code The University of Sydney Page 4 Methods we have learnt – Regression – Multivariate linear regression – Clustering – Hierarchical clustering – K-means clustering – Classification – Logistic regression – LDA – KNN – SVM – Random Forest – Decision trees – Boosted trees (Adaboost, XGBoost, GBM) The University of Sydney Page 5 Multiple Regression = 0 + 1 1 + 2 2 + …+ + – Find coefficients to minimise the total sum of squares of the residuals The University of Sydney Page 6 Local regression (smoothing) A typical model in this case is = + – The function f is some smooth function (differentiable). The University of Sydney Page 7 Density estimation – Maximum Likelihood approach – Reformulate as (1, 2, … , |) Probability of observing 1, 2, … , given parameter(s) = ς=1 →ln = σ=1 ln The University of Sydney Page 8 Kernel density estimation – Smooths the data with a chosen hyperparameter (bandwidth) to estimate the density. መ = 1 ℎ =1 − ℎ The University of Sydney Page 9 Hierarchical Clustering – Bottom-up clustering approach. – Each point is its own cluster – Clustering tuned by merging close values The University of Sydney Page 10 K-means algorithm – 1. Data randomly allocated – 2. Centres computed. – Data matched to closest centre. – Repeat. The University of Sydney Page 11 Principal Components Analysis (PCA) – Find linear combinations of variables that maximum the variability. The University of Sydney Page 12 PCA and t-SNE PCA tSNE The University of Sydney Page 13 Logistic Regression Logistic regression model: = log 1 − = 0 + 11 +⋯+ = = Pr( = 1|) = ℎ = = 1 1 + − 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P ro ba bi lit y -5 -4 -3 -2 -1 0 1 2 3 4 5 X 1 0.5 The University of Sydney Page 14 Linear Discriminant Analysis (LDA) : Probability of coming from class k (prior probability) : Density function for X given that X is an observation from class k The University of Sydney Page 15 Cross validation – Fitting model to entire dataset can overfit the data and not perform well on new data – Split data into training and tests sets to alleviate this and find the right bias/variance trade-off. The University of Sydney Page 16 Bootstrap – Simulate related data (sampling with replacement) and examine statistical performance on all the re-sampled data. The University of Sydney Page 17 Support Vector Machines (SVM) – Find the best hyperplane or boundary to separate data into classes. – Image taken from https://en.wikipedia.org/wiki/Support_vector_machine The University of Sydney Page 18 Missing Data – Remove missing data (complete cases) – Single Imputation – Multiple imputation – Expert knowledge of reasons for missing data. The University of Sydney Page 19 Basic decision trees – Partition space into rectangular regions that minimise outcome deviation. Millions The University of Sydney Page 20 Bagging trees and random forests – Use bootstrap technique to create resampled trees and average the result. – መ = 1 σ=1 መ∗() – Random forests do further sampling to improve model. The University of Sydney Page 21 Boosting – Fit tree to residuals and learn slowly – Slowly improve the fit in areas where the model doesn’t perform well. – Some boosting algorithms discussed – AdaBoost – Stochastic gradient boosting – XGBoost The University of Sydney Page 22 Feature Selection – Filter selection via fold changes. – Best subset selection. – Forward selection. – Backward selection. – Choose model that minimises test error – Directly via test set – Indirectly via penalised criterion. The University of Sydney Page 23 Ridge Regression and Lasso – Constrained optimisation techniques that minimise the squares with different constraints. – Lasso has the extra benefit of feature selection as a free bonus. The University of Sydney Page 24 Monte Carlo Methods – Repeated simulation to estimate the full distribution and summary values. – Exploits law of large numbers. – Can sample from f if inverse of exists, then we can generate as: = −1 – Acceptance rejection method to handle more difficult distributions. = න() ∙ ≈ 1 =1 () The University of Sydney Page 25 Markov Chain Monte Carlo – Big use in modelling Bayesian methods. – Simulates a process (random variable that changes over time) – Simulate new point based off the current point. – Can estimate even more complex distributions that in Monte Carlo methods. The University of Sydney Page 26 Methods and metrics to evaluate models – Sensitivity and specificity – Accuracy – Residual sum of squares (for regression) – ROC curves and AUC – K-fold cross-validation The University of Sydney Page 27 Example multiple choice question Which of the following method(s) is/are unsupervised learning methods? A. K means clustering B. Logistic regression C. Random forest D. Support vector machines The University of Sydney Page 28 Example short answer question a. Explain how the parameters are estimated in simple least squares regression. b. Explain a scenario where simple linear regression is not appropriate. c. Compute the predicted weight for a person that is 160cm tall and compute the residual of the first person in the table below. Sample X : Height (cm) Y: Weight (kg) 1 160 60 2 170.2 77 3 172 62 = 50.412 + 0.0634 The University of Sydney Page 29 Example long answer question – Describe the Markov Chain Monte Carlo procedure. You may use pseudo code as part of your answer.