辅导案例-MAST90084-Assignment 1

MAST90084: Statistical Modelling Assignment 1 Due time: 11PM, Wednesday May 6. DO NOT FORGET TO COMPLETE THE PLAGIARISM DECLARATION ON THE SUBJECT’S LMS BEFORE SUBMIT YOUR FIRST ASSIGNMENT. 1. Data in the following 2× 2× 3 table were used to study the effect of passive smoking on lung cancer. The table summarizes the results of case-control studies from 3 countries for nonsmoking women married to smokers. (Source: Blot and Fraumeni, J. Nat. Cancer Inst., 77:993-1000 (1986) and Agresti (1996).) [15] Country Spouse Smoked Cases Controls Japan No 21 82 Yes 73 188 UK No 5 16 Yes 19 38 USA No 71 249 Yes 137 363 (a) A log-linear model mod1 can be fitted to the data, with the results being given in the following R output. Give the mathematical formula (of form ln(λ) = · · ·) for model mod1. Explain why this model is called a homogeneous association model. > pasSmoking.dat=data.frame(freq=c(21,73,5,19,71,137,82,188,16,38,249,363)) > pasSmoking.dat$Cnt=factor(rep(c(“Japan”,”UK”, “USA”), times=2, each=2)) > pasSmoking.dat$Smo=factor(rep(c(“No”,”Yes”), times=6)) > pasSmoking.dat$Can=factor(rep(c(“Case”,”Control”), each=6)) > pasSmoking.dat freq Cnt Smo Can 1 21 Japan No Case 2 73 Japan Yes Case 3 5 UK No Case 4 19 UK Yes Case 5 71 USA No Case 6 137 USA Yes Case 7 82 Japan No Control 8 188 Japan Yes Control 9 16 UK No Control 10 38 UK Yes Control 11 249 USA No Control 12 363 USA Yes Control > mod1=glm(freq~Cnt+Smo+Can+Cnt:Smo+Cnt:Can+Smo:Can, family=poisson, data=pasSmoking.dat) > anova(mod1, test=”Chisq”) Analysis of Deviance Table; Model: poisson; Link: log; Response: freq Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 11 1168.85 Cnt 2 726.43 9 442.42 < 2.2e-16 Smo 1 112.52 8 329.90 < 2.2e-16 Can 1 307.56 7 22.34 < 2.2e-16 Cnt:Smo 2 15.50 5 6.84 0.0004316 Cnt:Can 2 1.05 3 5.80 0.5919109 Smo:Can 1 5.56 2 0.24 0.0184215 > 1-pchisq(0.24,2) [1] 0.8869204 > 1-pchisq(5.80,3) [1] 0.1217566 MAST90084 Statistical Modelling Assignment 1 Semester 1, 2020 (b) Test based on mod1 the significance of the interaction effect Smo:Can, eliminating the effects of all other terms in the model. Comment on the implication of your result. (c) Test the adequacy of model Cnt+Smo+Can+Cnt:Smo+Cnt:Can, at significance level 0.05, using the R output in (a). Comment on the implication of your result and how it is related to the result of (b). 2. This question refers to the quasi-likelihood method for GLM given in the lecture notes. Show the following results are true. [15] (a) Based on the definition of quasi-likelihood, the quasi-score function is given by s(β) = n∑ i=1 ziDi(β)[σ 2 i (β)] −1[yi − µi(β)] = ZTD(β)Σ−1(β)[y − µ(β)]. (b) The expected quasi-information is F (β) = n∑ i=1 ziz T i wi(β) = Z TW (β)Z. (c) The variance matrix of s(β) is V (β) = Cov(s(β)) = n∑ i=1 ziz T i D 2 i (β) · σ20i σ4i (β) . 3. Let yi = (yi1, · · · , yiq)T be a q×1 random vector following a probability distribution from multi-parameter exponential family. Namely, the pdf of yi is f(yi|θi, φ, wi) = exp { yTi θi − b(θi) φ wi + c(yi, φ, wi) } , where θi = (θi1, · · · , θiq)T is a q×1 natural parameter vector, φ is a dispersion parameter and wi is a weight. It is known that E [ ∂ ln f ∂θij ] = 0, j = 1, · · · , q; and E [ ∂2 ln f ∂θij∂θij′ ] +E [( ∂ ln f ∂θij ) · ( ∂ ln f ∂θij′ )] = 0, j, j′ = 1, · · · , q. Using these properties show that E(yij) = ∂b(θi) ∂θij , Var(yij) = φ wi ·∂ 2b(θi) ∂θ2ij and Cov(yij , yij′) = φ wi · ∂ 2b(θi) ∂θij∂θij′ , j, j′ = 1, · · · , q. [15] 4. Let Y be a response variable having k nominal categories. Let U1, · · · , Uk be k independent latent utility variables satisfying Ur = ur + εr, r = 1, · · · , k, with ur’s being fixed and εr’s being i.i.d. having cdf F and pdf f = F ′. Following the principle of maximum random utility it has been shown that Y = r if and only if Ur = max{U1, · · · , Uk}, r = 1, · · · , k. Moreover it has been shown that P (Y = r) = ∫ ∞ −∞ ∏ s 6=r F (ur − us + ε)f(ε)dε, r = 1, · · · , k. Using these results, find a closed-form result for P (Y = r) if an extreme maximal-value cdf F (x) = exp(− exp(−x)) is chosen. [15] 5. For response variable Y having k ordered categories, the cumulative model — based on given explanatory variable x, thresholds θ1, · · · , θq and a cdf F — is specified as P (Y ≤ r|x) = F (θr + xTγ). Find the link function for this model when F is chosen as the extreme maximal-value cdf. [15] 6. You need to install the R package faraway to do this question. The hsb data was collected from the High School and Beyond Study. Type help(hsb) to see the description of the dataset. We want to see how the relevant variables in the data are related to the choice of the type of program — academic, vocational, or general — that the students pursue in high school. The response is multinomial with three levels. [25] (a) Fit a trinomial response model with the other variables as predictors (untransformed). (b) For the student with id 99, compute the predicted probabilities of the three possible choices. Total marks = 100 2

辅导案例-MAST90084-Assignment 1

Related

Previous Post辅导案例-ED5022/CE4208

Next Post辅导案例-MATH266-Assignment 2

Author admin