- October 24, 2020

IE 332 – Homework #2 Due: Oct 26th, 11:59pm Read Carefully. Important! As outlined in the course syllabus this homework is worth 6% of your final grade. The maximum attainable mark on this homework is 75. As was also outlined in the syllabus, there is a zero tolerance policy for any form of academic misconduct. The assignment can be done individually or in pairs. By electronically uploading this assignment to Gradescope/Brightspace you acknowledge these statements and accept any repercussions if in any violation of ANY Purdue Academic Misconduct policies. You must upload your homework on time for it to be graded. No late assignments will be accepted. Only the last uploaded version of your assignment will be graded. NOTE: You should aim to submit no later than 30 minutes before the deadline, as there could be last minute network traffic that would cause your assignment to be late, resulting in a grade of zero. You must use the provided LATEXtemplate on Brightspace to submit your assignment. Page i of i IE 332 Homework #2 Due: Oct 26 2020 1. (0 points) These are style points meant to enforce the skill of communicating technical information in a precise, concise and easily interpretable way. You are penalized for (a) using poor grammar/spelling, (b) disorganized presentation of solutions (including organizing your code into functions, as appro- priate), (c) not commenting well your source code, (d) not using meaningful variable names in your code. At the discretion of the TA (who should be grading this “hard”). The presumption is that you do not do any of these things, and so doing them will cost addition points (up to -10). Your goal is to get 0/0 on this question. The assignment should have margins between 0.5 and 0.75 inches wide, with font size no larger than 12pt (11pt for code) and no smaller than 10pt – sub/superscripts, figure and plot captions excluded, but should be clearly legible. Clearly label each question. If a question requires more than 40% of the page to answer, then that is the only answer on that page. If multiple pages are required, this rule applies to the last of those pages. 2. The given jobs.csv file contains thousands of historical job postings, where each row (job posting) has multiple features (columns). The overall goal of this question is to use this historical data to build a probabilistic model that can in turn be used to create a dataframe of 10,000 realistic but synthesized job postings, each having the features Level, Salary.Range.From, Salary.Range.To, where: • Level refers to the seniority of the job • Salary.Range.From refers to the lower bound salary of this job • Salary.Range.To refers to the upper bound salary of this job Each of the questions below must be solved using as few lines of R code as possible. (a) (5 points) Data Cleaning. Read the csv file into R. Notice that the salary is not reported consistently across each job posting. Conduct data-cleaning so that the units of time and salary are standardized to a yearly salary as a full time employee for all postings. There may be more than one way to handle this data-cleaning problem; justify your approach as to which is best considering the overall goal of this question.Hint: Check columns Salary.Frequency and Full.Time.Part.Time.indicator. (b) (6 points) To gain some insight into the relationship between job seniority level and the upper limit salary, use at least one statistical method to interpret/analyze and an associated plot of the relationship. Justify your approach and interpretation of the utility of the results with respect to the overall goal of the question. (c) (10 points) Assume both Salary.Range.To and Salary.Range.From follow a normal distribution conditional on a certain job level. Then, the following code prepares us to simulate job postings considering the distribution over Level, Salary.Range.From and Salary.Range.To. ##Level $4A,4B,M6$ only have a couple of data points which are not enough to fit a distribution ##combine level 4A, 4B into 4 jobs2$Level[which(jobs2$Level %in% c(“4A”,”4B”))] = “4” ##combine level M6 into M7 jobs2$Level[which(jobs2$Level == “M6”)] = “M7” ##calculate marginal distribution on column Level v.nu.level2 = unclass(as.factor(data.matrix(jobs2$Level))) level.tbl2 = as.matrix(table(v.nu.level2)) m.level = as.numeric(level.tbl2/sum(level.tbl2)) ##Calculate the group mean and sd of lower bound salary on different job levels agg.from.mean = aggregate(Salary.Range.From ~ Level,jobs2,mean) agg.from.sd = aggregate(Salary.Range.From ~ Level,jobs2,sd) ##Calculate the group mean and sd of upper bound salary on different job levels agg.to.mean = aggregate(Salary.Range.To ~ Level,jobs2,mean) agg.to.sd = aggregate(Salary.Range.To ~ Level,jobs2,sd) Assuming the above code already exists, write a new function called my.sim that first simulates a numerical vector of job levels based on its marginal distribution m.level. Then, by iterating over each of the simulated job levels, simulate the corresponding salary lower and upper bound using the calculated group mean and standard deviation (MAXIMUM 12 LINES OF R CODE). (d) (8 points) Use your function from above to simulate 10,000 job postings with features of Level, Salary.Range.From and Salary.Range.To. To compare how well it can create realistic job post- ings you will evaluate the result obtained by your simulation function by the following criteria: • Is the relationship between the simulated job seniority and upper limit salary similar to what you obtain in part b? • Do the order between salary lower bound and salary upper bound and the range of each of them make sense? Discuss potential concerns you may have with either criteria, and propose good remedy to alleviate them. Record the number of times the second criteria was violated (before corrections) as the proportion of rejected samples from the total number of records generated. (e) (6 points) Notice that the previous function for synthesizing data did not consider the joint distribution between columns Salary.Range.From and Salary.Range.To. We will now correct this using the method of inverse sampling, which is one way to sample from the joint distribution. Inverse sampling uses the fact that if we apply the cumulative distribution function FX of X on X itself, then the result follows a uniform distribution FX(X) ∼ U(0, 1). Then, if we map the randomly sampled uniform values U to the inverse of the cumulative distribution function F−1X , we will obtain the random variables which follow FX : F −1 X (U) ∼ F . The following code uses inverse sampling to sample from empirical joint distribution of columns Salary.Range.From and Salary.Range.To. Using this code, simulate 10,000 job records with features of Level, Salary.Range.From and Salary.Range.To.(Maximum 10 lines of R code). Hint: You may need to use the marginal distribution on Level: m.level. Compare this approach to your algorithm in part c by: 1. the time needed to execute them.(Maximum 5 lines of R code) Note, the R package stats has a function system.time that can be used to record the amount of time a function takes to execute. 2. the proportion of rejected samples(Maximum 4 lines of R code). v.char.level = agg.from.mean[,1] #get a unique character set of the job seniority level ##empirical CDF function where we count number of instances fall below or on the upper bound my_cdf = function(i,df,ubound.grid){ ##i: row index of ubound.grid ##df: empirical data ##ubound.grid: grid of upper bounds count = length(which(df[,1]<=as.numeric(ubound.grid[i,1]) & df[,2]<=as.numeric(ubound.grid[i,2]))) result.cdf = count/nrow(df) } IE 332 Homework #2 Page 2 of 5 ##function to find the first level of probability which reaches a uniform random variable bi_find_inverse#u: vector of uniform random variables #my.ecdf: corresponding cdf of simulating grids #ubound.grid:simulating grids inds = which(my.ecdf>=u) if(length(inds)>=1){ ind = inds[1] result = ubound.grid[ind,] }else{result = NA} return(result) } ##function to conduct inverse sampling my_inverse_sim #nu.level is the numerical job seniority level #n is the number of simulated records in total #input.df: the empirical dataframe after data cleaning #step.size is the increment between two grid points ###first, construct grid of upper bounds level = v.char.level[nu.level] #get level in character n1 = round(m.level[nu.level]*n)#get number of records to be simulated based on level salary.to = input.df$Salary.Range.To[input.df$Level==level] salary.from = input.df$Salary.Range.From[input.df$Level==level] from.min = min(salary.from) from.max = max(salary.from) from.seq = seq(from.min-step.size,from.max + step.size,by = step.size) to.min = min(salary.to) to.max = max(salary.to) to.seq= seq(to.min-step.size,to.max + step.size,by = step.size) ubound.grid = expand.grid(from.seq,to.seq) n.ubound = nrow(ubound.grid) ###then, calculate the empirical cdf based on upper bounds my.df my.ecdf = sapply(1:n.ubound,my_cdf,my.df,ubound.grid) ##simulate uniform random variables and find inverse of them based on empirical cdfs u1 = runif(round(n1)) bi.sim = lapply(u1,bi_find_inverse,my.ecdf,ubound.grid) ##remove any NA values for caution, shouldn’t be any bi.sim=Filter(Negate(anyNA), bi.sim) df.bi.sim = do.call(’rbind’, bi.sim) return(df.bi.sim) } IE 332 Homework #2 Page 3 of 5 3. (10 points) The ER diagram below describes the relationships between doctors who prescribe drugs to their patients and the production of such drugs by pharmaceutical companies and their commer- cialization by pharmacies. Provide all the CREATE TABLE statements needed to create the database. IE 332 Homework #2 Page 4 of 5 4. Using the database in the Question 3, provide SQL code to answer for the following questions: (a) (5 points) What are the drugs prescribed by doctors with more than 15 years of experience? List the drug’s trade name and the name of the pharmaceutical company that produces it. (b) (5 points) What are the drugs that have the highest price in the market? List both the trade name of the drug and the company that produces it and its price. Hint: Take into account that there might be more than one drug with highest price. You may want to use embedded queries to answer this item. (c) (6 points) What are the drugs prescribed to patients (strictly) over 60 years old in quantities greater or equal than 100mg that have the suffix “in” in their trade name? For each drug, display its trade name, the company that produces it and the number of patients with the characteristics above that were prescribe such medication. Order the list by the latter in descending order. Display only the first 20 results. Note: quantities for all prescriptions are measured in miligrams. (d) (5 points) What is the name and address of pharmacies that signed at least two contracts with pharmaceutical companies that produce any drugs except “atorvastatin”? For each pharmacy, list also the number of pharmaceutical companies it has contracts with the condition above. Sort the results by the pharmacies with more contracts to the ones with less, breaking ties by the pharmacy’s name in alphabetical order. 5. Using again the database in Question 3, for each English statement below indicate whether the associated query is correct. If it is incorrect provide a fix to the query. If it is correct, simplify the query to achieve the same result but with less complicated SQL code. (a) (3 points) For any specialty, except for “cardiology”, with more than 10 doctors, list the specialty name and the total number of prescriptions issued by doctors in such specialty between March 1st, 2020 and July 31st, 2020. Hint: You can use BETWEEN to obtain dates between a certain period. Dates in SQL are displayed in the format ”YYYY-MM-DD”. SELECT specialty, COUNT(date) FROM doctor DOC RIGHT JOIN prescribes PB ON DOC.SSN = PB.SSN WHERE DOC.specialty NOT IN (SELECT * FROM doctor WHERE specialty = “cardiology”) AND DOC.specialty >= 10 GROUP BY specialty (b) (6 points) Display the name of each pharmaceutical company with 50 contracts or more together with the number of those contracts that will expire in 2021. Hint: YEAR() can be used to obtain the year of a date. Ex: YEAR(“2017-06-15”) returns 2017. SELECT C.company_name, C.COUNT(end_date) FROM (SELECT company_name, end_date FROM contracts_with GROUP BY company_name HAVING COUNT(pharmacy_name) >= 50 ) AS company_date GROUP BY C.company_name WHERE YEAR(C.end_date) = 2021 IE 332 Homework #2 Page 5 of 5 欢迎咨询51作业君