- July 20, 2020
UNSW SYDNEY SCHOOL OF MATHEMATICS AND STATISTICS Midterm test 2020 MATH3821 Statistical Modelling and Computing (1) TIME ALLOWED – 2 HOURS (2) TOTAL NUMBER OF QUESTIONS – 1 (3) ANSWER ALL QUESTIONS (4) THE QUESTIONS ARE NOT OF EQUAL VALUE (5) THIS PAPER MAY BE RETAINED BY THE CANDIDATE Instructions: • Download and Open (click on) the file mid-2020.Rmd. • Fill in (between the ” “) your familyname, othername and studentnumber (top of the file). • Click on Knit ( ). This should create and open the resulting PDF file. • Save very regularly your work (the Rmd file). • Click on Knit each time you have completed a chunk, and check the output in the PDF file. • Submit your pdf and Rmd file via the submission link on Moodle prior to the deadline. Midterm test 2020 MATH3821 Page 2 1. [33 marks] The Coronary Risk-Factor Study (CORIS) data involve 462 males between the ages of 15 and 64 from three rural areas in South Africa, (Rousseauw et al. (1983)). The outcome Y is the presence (Y = 1) or absence (Y = 0) of coronary heart disease. There are nine covariates: systolic blood pressure, cumulative tobacco (kg), ldl (low density lipoprotein cholesterol), adiposity, famhist (family history of heart disease), typea (type-A behavior), obesity, alcohol (current alcohol consumption), and age. We will use data which is available in the file coris.txt. a) [1 mark] Read the data file coris.txt into a dataframe called coris.df using the read.table() function. You will need to use the argument sep = “,”. Then use the str function to gain some understanding about the data set. b) [2 marks] Find the proportion of males (prop.chd) in the study that have coronary heart disease. Find the odds (odds) of coronary heart disease. Find the log odds (logodds) of coronary hear disease. c) [4 marks] Men with a family history of coronary heart disease are more likely to have a coronary heart disease than those who do not. Estimate the proportions with coronary heart disease among those with a family history (prop.chd.famhist) and the others without a family history (prop.chd.oth). Estimate the odds ratio directly from the variables prop.chd.famhist and prop.chd.oth. Find this same value using the glm() function. Test its significance (what is the p-value?). d) [3 marks] Fit an appropriate regression model including an intercept term, with the presence of coronary heart disease (Y) as the response. We will use the predictors in the following order: systolic blood pressure (sbp), tobacco (tobacco), age (age), obesity (obesity), alcohol (alcohol) and family history of coronary heart disease (famhist). Do not forget to encode categorical or binary predictors as factors. Produce output that shows which explantory variables have a significant effect at the five percent level and comment on the results. e) [2 marks] Are you surprised by the fact that systolic blood pressure is not significant or by the minus sign for the obesity and alcohol coefficients? Explain why or why not. f) [2 marks] Compute and interpret carefully the odds ratio for family history of coronary heart disease (famhist) based on the regression model in part (d). g) [4 marks] Test the significance of famhist using a deviance approach based on the regression model in part (d). You will need to provide the decrease in deviance (famhist.deviance) when the variable famhist is removed from the model. What is the associated p-value Please see over . . . Midterm test 2020 MATH3821 Page 3 as output by the R function you used? You will also use the pchisq() function on the famhist.deviance variable to confirm this finding. What is your conclusion based on the reduction in deviance? h) [2 marks] For each individual predict the probability that they will NOT have coronary heart disease and compute the average of these values. Compare this average value with the observed proportion. i) [2 marks] Suppose we are interested in predicting a males systolic blood pressure (sbp) based on the indivduals obesity (obesity) levels. Estimate the r(·) function by cubic smoothing spline regression. Let’s call this estimate rˆ. You will use the value 0.1 for the lambda argument. You will store the results of your estimation in a variable called res.smooth. Display the content of res.smooth. j) [1 mark] Produce a scatterplot of sbp aganist obesity and then add the smoother to the scatter- plot. k) [5 marks] Create and then plot the Generalised Cross-Validation score GCV versus lambda, for values λ = 0.008 + i× 0.000001, i = 0, . . . , 1000. Note that the formula for GCV is given by GCV = RSS n(1− tr(Sλ)/n)2 , where RSS = ∑n i=1(yi − yˆi)2 and Sλ is the smoothing matrix with tr(Sλ) = df . l) [1 mark] What value of lambda do you recommend to choose now instead of the one used in (i)? m) [4 marks] Compute the density estimate for the variable sbp. Produce a variability plot with a 94% confidence interval and add it to the plot. For the variability plot, generate 1000 bootstrap resamples and evaluate the density function at 100 equally spaced points over the range of the variable sbp. Label the plot appropriately.