- September 16, 2020
BUSINESS DATA MINING (IDS 572) HOMEWORK 1 DUE DATE: WEDNESDAY, SEPTEMBER 16 AT 3:30 PM • Please provide succinct answers to the questions below. • You should submit an electronic pdf or word file in blackboard. • Please include the names of all team-members in your write up and in the name of the file. • One submission is sufficient for the entire group. • You should include all the R functions you use in your pdf file. • Please make sure your graphs have titles, labels, legends (if necessary), appropriate colors and etc. Problem 1. This question should be done without using R. We have just put out a special promotion and would like to determine who responded to the mailing. We have a sample of consumers, including both purchasers and non-purchasers (all received the promotion), and would like to predict who is a purchaser. For each consumer, we have their age (bucketed into ranges), their income which is a numerical variable, and whether or not they responded to last year’s mailing. Purchase? Age Income Last Year? Yes Young 60K No Yes Middle 60K Yes Yes Old 100K Yes Yes Young 60K Yes Yes Young 100K Yes Yes Middle 60K No No Old 150K Yes No Middle 100K No No Young 150K Yes No Middle 100K Yes No Old 150K No Please show your calculations in the following questions, (a) Using the 1-rule method discussed in class, find the relevant sets of classification rules for the target variable (Purchase?) by testing each of the input attributes. Which of these sets of rules has the lowest misclassification rate? (b) Now, we want to construct a decision tree using this data set. What is the entropy measure of the entire data set? (c) What is the information gain for splitting on age? On income? On last year? Which should be the initial split variable? 1 2 HOMEWORK 1 DUE DATE: WEDNESDAY, SEPTEMBER 16 AT 3:30 PM (d) Construct the entire tree using the information gain. What is the decision at each terminal node? (e) What is the accuracy of your decision tree on this data set? Explain your answer. (f) What would your tree predict for a Middle aged person with 90K income who did not purchase last year? Justify your answer. Consider the following instances as test data points. Purchase? Age Income Last Year? Yes Middle 60K No No Young 90K Yes Yes Old 100K No Yes Young 60K Yes Yes Middle 140K No What is the accuracy of your decision tree model on the test data? Justify your answer. (g) What are the support and confidence of the rules – If Last Year = Yes, then Yes. – If Age = Middle, and Income = 50K, then Yes. (h) Based on your decision tree, which variable is considered the most important variable? Justify your answer. The goal of the remaining questions is to review statistical R programming and get comfortable with coding in R, especially for those ones of you who did not learn R in IDS 570. The questions are asking for simple tasks but try your best to code them in the most elegant way that you can. For example, you can use the function “tibbles” instead of data-frame for a nicer looking data frames. Take care of details and try to play with the functions arguments to check how they work. Again, since some of you are new with R, please feel free to contact me or the TA if you have any specific questions. Problem 2. In this question, we use the built-in R dataset called attitude which contains information from a survey of the clerical employees of a large financial organization. To access this date set use “data(“attitude”)”. Learn more about each variable by reading the variable description in ?attitude. (a) Summarize the main statistics of all the variables in the data set. (b) How many observations are in the attitude dataset? What function in R did you use to display this information? (c) Produce a scatterplot matrix of the variables in the attitude dataset. What seems to be most correlated with the overall rating? (d) Produce a scatterplot of rating (on the y-axis) vs. learning (on the x-axis). Add a title to the plot. BUSINESS DATA MINING (IDS 572) 3 (e) Produce 2 side-by-side histograms, one for rating and one for learning. You will need to use par(mfrow=…) to get the two plots together. Problem 3. Use hw1.xls to answer the following questions. Include all the charts with proper labels in your report. Please, for each of the chart you produced, write a sentence or two explaining what you see from the chart. (a) Make a frequency distribution table for the gender variable to see the frequency distribution. (b) Make a bar chart for gender variable. (c) Make a histogram to display the distribution of the Height variable. (d) Make a cluster bar chart (side-by-side bar chart) to examine the correlation between gender and Ate Fried Food variables. (e) Make a scatter plot to examine the correlation between Weight and Height variables, and write a sentence to describe the trend you observed from the scatter plot. (f) Find the 5-number summary for the Height data and make a boxplot for the Height data with mild and extreme outliers identified using inner and outer fences. Draw the boxplot. Problem 4. To do this question you need the following packages in R: MASS, plyr, dplyr, tibble, and ggplot2. We are going to use the builtin data set “birthwt” (Risk Factors Associated with Low Infant Birth Weight) from the MASS library. This dataset contains 189 instances and 10 variables. To learn more about this data set you can use ?birthwt. (a) All the variables are represented as integer. Write your own function that automatically converts all the integer variables to factors (categorical). (b) Repeat part (a) using mutate() and mapvalues() functions. (c) Use the tapply() function to see what the average birthweight looks like when broken down by race and smoking status. Does smoking status appear to have an effect on birth weight? Does the effect of smoking status appear to be consistent across racial groups? What is the association between race and birth weight? (d) Use kable() function from knitr to dispaly the table you get in part (c). (e) Use ddply() function to get the average birthweight by mother’s race and compare it with tapply() function. (f) Use ggplot2() to plot the average birthweight (computed in part (e)) for each race group in a bar plot. (g) Use ddply() function to look at the average birthweight and proportion of babies with low birthweight broken down by smoking status. (h) Split the data further by adding “mother smokes” to the ddply() formula used in part (g). 4 HOMEWORK 1 DUE DATE: WEDNESDAY, SEPTEMBER 16 AT 3:30 PM (i) Is the mother’s age correlated with birth weight? Does the correlation vary with smoking status? Problem 5. “ggplot() produces far better and more easily customizable graphics than anythor visual- ization functions in R. There are two basic calls in ggplot: • qplot(x, y, . . . , data): a “quick-plot” routine, which essentially replaces the base plot(). • ggplot(data, aes(x, y, …), …): defines a graphics object from which plots can be generated, along with aesthetic mappings that specify how variables are mapped to visual properties. In this question, we would like to quickly practice drawing different plots using ggplot2(). For this purpose, we use the “diamonds” dataset in R. You can access this dataset by writing “data(diamonds)”. (a) What type of variable is price? Would you expect its distribution to be symmetric, right-skewed, or left-skewed? Why? Make a histogram of the distribution of diamond prices. Does the shape of the distribution match your expectation? (Use geom histogram()). (b) Visualize a few other numerical variables in the dataset and discuss any interesting features. When describing distributions of numerical variables we might also want to view statistics like mean, median, etc. (c) What type of variable is color? Which color is most prominently represented in the dataset? (d) Make a bar plot of the distribution of cut, and describe its distribution (Use georm bar()) (e) Make a histogram of the depths of diamonds, with binwidth of 0.2%, and add another variable (say, cut) to the visualization. You can do this either using an aesthetic or a facet. Typical diamonds of which cut have the highest depth? On average, does depth increase or decrease as cut grade increase or decrease? (f) Compare the distribution of price for the different cuts. Does anything seem unusual? Describe. (g) Draw a scatterplot showing the price (y-axis) as a function of the carat (size). (h) Shrink the points in your scatter plot in part (g) using the alpha argument in geom point. (i) Use facet wrap(∼ factor1 + factor2 + … + factorn) command to create scatter plots showing how diamond price varies with carat size for different values of “cut” (use colour = color in aes()).