Skip to main content
留学咨询

辅导案例-131A

By May 15, 2020No Comments

Final Exam Stat 131A Spring 2020 NAME: Directions – PLEASE READ CAREFULLY • The exam is due on Gradescope Thursday night at 11:59pm. • You must show your work (where appropriate). • You may refer to your notes or online notes from this class. (It is technically allowed to read other materials online, but I very much doubt that will help you at all). • You may NOT discuss the exam questions with anyone else, whether they are in the class or out of the class. You may not post questions on message boards relating to problems on this exam. Any violations of this policy will result in a zero for the exam. • If you have any clarification questions, please send an email to Will and George, NOT Piazza. • DO NOT post on Piazza about exam questions. 1 2 1. Answer the following TRUE/FALSE questions. Briefly explain your reasoning. (a) (5 points) Because a regression coefficient measures the effect of a predictor vari- able on the response variable when we hold everything else constant, we can interpret the coefficient as a causal effect of that predictor on the response. (b) (5 points) Suppose we have a data set with p variables, x(1) up to x(p), and we calculate the loadings a1, . . . , ap for the first principal component. Then, suppose we center and scale each of the variables to get new y variables with mean 0 and standard deviation 1. That is, the value of variable y(j) for the ith observation is y (j) i = (x (j) i −mean(x(j)))/sd(x(j)) True or false: If we calculate new first principal component loadings b1, . . . , bp on the new data set with the y variables, then the new loadings might be different from the old loadings. (c) (5 points) If somehow we knew the true values of β in a linear regression, then those values would give us an even smaller RSS than the estimated coefficients βˆ that we get from lm. (d) (5 points) Suppose we fit a regression of response y on variables x(1) and x(2), using the lm function. We get βˆ2 = 5 with a p-value in the regression table equal to 0.04. But after we add a third variable x(3) to the regression, the estimate changes to βˆ2 = −15, and the p-value in the new regression table is now 10−12. True or false: This means that the first estimate almost certainly had the wrong sign, and the p-value must have been small the first time just by chance. (e) (5 points) In part d, it is impossible for the first model (with predictor variables x(1) and x(2)) to have a strictly lower RSS than the the second model (with predictor variables x(1), x(2), and x(3)). (f) (5 points) In agglomerative hierarchical clustering, if we use the ”single linkage” method to measure distances between clusters, then we will sometimes see a ”rich get richer” effect in intermediate steps of the algorithm, where there are a few very large clusters and many much smaller clusters. 3 2. Bar plot for two categorical variables The bar plot below shows the joint distribution of two categorical variables measured in the Wellbeing survey that we discussed in class. The participants in the survey were asked to report their General Satisfaction and their Job Satisfaction, and their responses were recorded. There are four levels of Job Satisfaction (Very satisfied, Moderately satisfied, A little dissatisfied, and Very dissatisfied), and three levels of General Satisfaction (Very happy, Pretty happy, and Not too happy). Very satisfied (Job) Mod. satisfied (Job) A little dissat (Job) Very dissatisfied (Job) Very happy (General) Pretty happy (General) Not too happy (General) Fr eq ue nc y 0 50 0 10 00 15 00 20 00 25 00 Note: to make the question simpler, I have removed categories like ”Don’t know” and ”Not applicable” from the data set; you should answer the question as if these categories never existed and the data shown here represent the entire data set. Answer the following questions and explain how you can tell. You can assume the data set is large enough so that the observed probabilities in the figure are very close to the probabilities in the overall population. (a) (5 points) Overall, what is the most common response for General Satisfaction? (b) (5 points) Which is higher: (A) the conditional probability of being Pretty happy in general given that someone is Very satisfied with their job, or (B) the condi- tional probability of being Pretty happy in general given that someone is Moder- ately satisfied with their job? (c) (5 points) Given that someone is Not too happy in general, which response were they most likely to give about their Job Satisfaction? 4 3. Hierarchical clustering The plot below shows a data set with p = 2 variables and n = 7 data points. The seven data points are labeled A,B,C,D,E,F,G. x(1) x( 2) A B C D E F G The Euclidean distances between pairs of points are shown in the matrix below. For example, the value corresponding to row ”C” and column ”D” gives the Euclidean distance between the points C and D. ## A B C D E F G ## A 0.0 0.2 0.9 2.3 2.1 2.6 4.1 ## B 0.2 0.0 0.7 2.1 2.1 2.5 3.9 ## C 0.9 0.7 0.0 1.4 1.7 2.2 3.2 ## D 2.3 2.1 1.4 0.0 2.0 2.3 1.8 ## E 2.1 2.1 1.7 2.0 0.0 0.4 3.2 ## F 2.6 2.5 2.2 2.3 0.4 0.0 3.3 5 ## G 4.1 3.9 3.2 1.8 3.2 3.3 0.0 For the following questions, assume (except where expressly stated otherwise) that we are using agglomerative hierarchical clustering with Euclidean distances and complete linkage. (a) (10 points) What two ”clusters” will the algorithm join in each of the first four steps, and what will the algorithm use as the distance between each pair of clus- ters? (You only need to give the distance between the two ”clusters” that were joined in each step). (b) (10 points) Would any part of your answer to part a be different if we were using single linkage clustering instead? (c) (5 points) The dendrogram for the complete linkage clustering is shown below, with the seven ”leaves” (dangling ends) of the tree unlabeled: 0 1 2 3 4 Cluster Dendrogram hclust (*, “complete”) H ei gh t 6 Write the letters A to G below the leaves to give a valid labeling that corresponds to the complete linkage clustering (note there is more than one possible labeling, you only need to do one of them). 7 4. Consider the Ames Housing Dataset that was introduced in class. This dataset contains information on sales of houses in Ames, Iowa. The original dataset has been made smaller to make the analysis easier. The variables in the dataset are: • Lot.Area: Lot Size (Land Area) in square feet. • Total.Bsmt.SF: Total square feet of basement area. • Gr.Liv.Area: Total living area in square feet. • Garage.Cars: Size of garage in terms of car capacity. • Fireplaces.YN: a yes (Y) or no (N) variable that indicates whether the house has a fireplace or not. • Year.Built: the year in which the house was built. • SalePrice: the sale price of the house in dollars. There are n = 1314 observations in the dataset and some observations are listed below: head(ames) ## Lot.Area Total.Bsmt.SF Gr.Liv.Area Garage.Cars Fireplaces.YN ## 1 11622 882 896 1 N ## 2 14267 1329 1329 1 N ## 3 4920 1338 1338 2 N ## 4 5005 1280 1280 2 N ## 5 7980 1168 1187 2 N ## 6 8402 789 1465 2 Y ## Year.Built SalePrice ## 1 1961 105000 ## 2 1958 172000 ## 3 2001 213500 ## 4 1992 191500 ## 5 1992 185000 ## 6 1998 180400 (a) (5 points) I fit a regression equation with SalePrice as the response variable in terms of all the other variables (note that Bldg.Type and Fireplaces.YN are treated as factors). This gave me the following: m1 m1 ## ## Call: ## lm(formula = SalePrice ~ ., data = ames) ## 8 ## Coefficients: ## (Intercept) Lot.Area Total.Bsmt.SF Gr.Liv.Area ## -9.795e+05 8.489e-01 3.230e+01 5.130e+01 ## Garage.Cars Fireplaces.YNY Year.Built ## 9.694e+03 8.768e+03 5.125e+02 What does this regression equation say about the change in our prediction for SalePrice for a 100 square ft. increase in Gr.Liv.Area provided the other vari- ables remain unchanged? (b) (5 points) According to the regression equation m1, find the predicted SalePrice for a house with a lot area of 10000 square feet, total basement square footage of 1000 square feet, 1000 square feet of total
living area, having a fireplace and a garage that can hold two cars and which was built in the year 2000. (c) (5 points) My friend who has a lot of experience with the Ames real estate market suggests to me that I should add an interaction between Fireplaces.YN and Year.Built, because she thinks that fireplaces used to be more common and now they are a luxury. I fit a new regression equation with an interaction term added: m2 m2 ## ## Call: ## lm(formula = SalePrice ~ . + Year.Built:Fireplaces.YN, data = ames) ## ## Coefficients: ## (Intercept) Lot.Area ## -9.009e+05 8.571e-01 ## Total.Bsmt.SF Gr.Liv.Area ## 3.248e+01 5.078e+01 ## Garage.Cars Fireplaces.YNY ## 9.579e+03 -2.351e+05 ## Year.Built Fireplaces.YNY:Year.Built ## 4.727e+02 1.240e+02 The estimate of the interaction term is 124. What is the interpretation of that number in the regression equation? (d) (5 points) For the house in part b, assume there is a newer house which is exactly the same except that it was built 5 years later, in 2005. Compared to the older house, how much more would we predict the newer house to sell for, if we use model m2 to make our predictions? (e) (5 points) Assume that we make the usual assumptions to justify parametric in- ference in regression. Give a parametric 95% confidence interval for the coefficient of the interaction, based on the summary table below (you may use 2 for the tα/2 9 value). Based on the result, can we feel confident that the interaction term is helping us make better predictions? summary(m2) ## ## Call: ## lm(formula = SalePrice ~ . + Year.Built:Fireplaces.YN, data = ames) ## ## Residuals: ## Min 1Q Median 3Q Max ## -87632 -11006 16 10112 94784 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -9.009e+05 4.994e+04 -18.039 < 2e-16 *** ## Lot.Area 8.571e-01 1.428e-01 6.004 2.49e-09 *** ## Total.Bsmt.SF 3.248e+01 2.154e+00 15.081 < 2e-16 *** ## Gr.Liv.Area 5.078e+01 2.643e+00 19.216 < 2e-16 *** ## Garage.Cars 9.579e+03 9.098e+02 10.529 < 2e-16 *** ## Fireplaces.YNY -2.351e+05 7.959e+04 -2.954 0.00320 ** ## Year.Built 4.727e+02 2.575e+01 18.359 < 2e-16 *** ## Fireplaces.YNY:Year.Built 1.240e+02 4.048e+01 3.064 0.00223 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 18300 on 1306 degrees of freedom ## Multiple R-squared: 0.7459,Adjusted R-squared: 0.7445 ## F-statistic: 547.6 on 7 and 1306 DF, p-value: < 2.2e-16 (f) (5 points) My friend gets very excited and wants to put in more interaction terms with Fireplaces.YN. After adding interactions with two more variables, we get the model: m3 + Fireplaces.YN:Gr.Liv.Area, data = ames) She points out that the R2 value has improved, so the new model is a better fit: summary(m3)$r.squared ## [1] 0.7459505 summary(m2)$r.squared ## [1] 0.7458855 10 She says this means the new interaction terms must be making the model better so we should keep them in. Do you agree? (g) (5 points) My friend wants to find more support for her theory that we need some more interactions with Fireplaces.YN, so she looks for the best model with Year.Built:Fireplaces.YN plus two additional interactions. That is, she always keeps the interaction with Year.Built in the model, and tries out every single way to add two more interactions with Fireplaces.YN, trying to make the R2 as big as possible. She finds that the best two variables to interact with Fireplaces.YN are Lot.Area and Total.Bsmt.SF, and she wants to show that this model makes better out- of-sample predictions. She uses cross-validation to compare the model she chose (with three interactions) to the model m2 (with only one interaction), and she finds that her model has slightly better prediction error, as measured by cross- validation. Does this show her model is better? (h) (5 points, extra credit) I now fit a regression equation to the same dataset with the logarithm of SalePrice as the response variable (I left the explanatory variables unchanged). This gave me the following: m4 m4 ## ## Call: ## lm(formula = log(SalePrice) ~ ., data = ames) ## ## Coefficients: ## (Intercept) Lot.Area Total.Bsmt.SF Gr.Liv.Area ## 3.641e+00 7.349e-06 2.146e-04 3.842e-04 ## Garage.Cars Fireplaces.YNY Year.Built ## 7.578e-02 5.699e-02 3.739e-03 Assume that our predicted SalePrice for a certain house with no fireplace, using model m4, is $150,000. What would the predicted SalePrice become if we add a fireplace to the house and leave all of the other variables unchanged? (i) (5 points) We decide to investigate a bit more how fireplaces relate to some of the other variables in the data set, so we create a new binary outcome variable called Fireplaces.10, which is 1 if Fireplaces.YN is Y and 0 otherwise. We estimate a logistic regression using the glm function in R, and we get the following model: ames$Fireplaces.10 m5 family = binomial) m5 11 ## ## Call: glm(formula = Fireplaces.10 ~ Year.Built + SalePrice, family = binomial, ## data = ames) ## ## Coefficients: ## (Intercept) Year.Built SalePrice ## 2.942e+01 -1.784e-02 3.535e-05 ## ## Degrees of Freedom: 1313 Total (i.e. Null); 1311 Residual ## Null Deviance: 1729 ## Residual Deviance: 1493 AIC: 1499 According to model m5, how likely is it that a house built in 1950, whose current SalePrice is $100,000, would have a fireplace? What about a house built in 2000? Give your answers as probabilities. 12

admin

Author admin

More posts by admin