辅导案例-CS 663

CS 663 – Machine Learning Spring, 2020 Data Challenge The purpose of this challenge is to test your ability to write software and models to collect, normalise, store, analyse and visualise “real world” data. This challenge is designed to mimic those you may receive upon applying to positions as Data Scientist or Machine Learning Engineer. You may draw on your work in the lab assignments. The challenge is designed to take about two hours, but it is not timed. Please deliver your results by the due date. You may also use any tools or software on your computer, or that are freely available on the Internet as long as the tool works with a Jupyter notebook. We prefer that you use simpler tools to more complex ones and that you are “lazy” in the sense of using third party APIs and libraries as much as possible. The use of obscure, undocumented “black box” libraries is discouraged. Do as much as you can, as well as you can. Prefer efficient, elegant solutions. Prefer scripted analysis to unrepeatable use of GUI tools. For data security and transfer time reasons, you have been given a relatively small data file. Prefer solutions that do not require the full data set to be stored in memory. Finally, we are also interested in your ability to work on a team, which means considering how to package and deliver your results in a way that makes it easy for us to review them. This does NOT mean you are allowed to discuss with others or use their work, including those enrolled in this or similar courses. It does mean that undocumented code and data dumps are virtually useless; commented code and a clear writeup with elegant visuals are ideal. Also consider how asking targeted questions to members of our team may allow you to get more done in less time. Background Health Inspectors from the Health Department of the City and County of San Francisco routinely conduct inspections of restaurants (“facilities”). After conducting an inspection of a facility, a Health Inspector calculates a score based on the violations observed. Violations can fall into: ● High risk category: records specific violations that directly relate to the transmission of food borne illnesses, the adulteration of food products and the contamination of food-contact surfaces ● Moderate risk category: records specific violations that are of a moderate risk to the public health and safety CS 663 – Machine Learning – Spring 2020 – Data Challenge 1 ● Low risk category: records violations that are low risk or have no immediate risk to the public health and safety. These violations may also be graded — i.e. converted to an inspection score — and posted, for example, on the windows of the facilities. By design, some inspections do not contain violations or inspection scores. Data WIth these instructions, we have provided two CSV files: ● facility_scores_known.csv (9MB): 43,199 facility records plus 1 header ● facility_scores_unknown.csv (2 MB): 10,774 facility records, plus 1 header Requirements (Process) There are two (2) parts for this challenge: 1. Predict inspection scores. 2. Explain inspection scores. Predict inspection scores You may use the train data to create a model for predicting inspection scores of a facility. The inspection score for each facility is missing from the test set. You must use a model to predict the inspection scores of facilities for each instance in this set. You will submit your prediction for inspection scores, which the grader will compare against the actual values using MSE (mean squared error). Your prediction must be named “preds.csv ”, a file in CSV format with the one field: inspection score. This file must have one prediction for each facility appearing in the test.csv file, in order. For example: 76 17 Explain inspection scores Once your model has been created, you must provide an explanation of what factors best predict the facility’s score. CS 663 – Machine Learning – Spring 2020 – Data Challenge 2 Submission Submit the following to Github (starter link): ● Your Jupyter notebook for the implementation of the above ● Your model’s prediction of inspection scores as a CSV file ● A PDF document with an explanation of your process, findings, visualisations, etc. The submission deadline is 11:59 PM PDT on 13th May, 2020. Late submissions will not be accepted. Grading Each submission will be graded as follows: 50% Performance The competitive accuracy (as measured by MSE) of your model 1 as executed on a neutral system. 20% Explanation of Model The degree to which your process for finding explanatory features follows a reasonable process. In addition, the correct identification of explanatory features according to the following: 20% = Reasonable process and correct features derived 14% = Not following reasonable process or incorrect features 7% = Not following reasonable process and incorrect features 15% Code Quality The degree to which your solution is modular, easy to run, easy to read and contains comments helpful to a peer or other person with skills similar to yours. 15% = Completely 9% = Partially 3% = Poorly 10% Process The degree to which your solution follows a reasonable process and has documented this process. 10% = Completely 7% = Partially: missing process details / module documentation 3% = Poorly: missing several major details / most documentation 5% Execution Time The competitive wall clock execution time of your model as executed on a neutral, CPU-based system 1 For competitive grading, the submissions with the top performance get a full-credit score (eg. 50/50 on Performance). Other submissions which do not yield top performance are ranked and graded accordingly. Your model will be executed once, so be wary of models with varying / random performance. CS 663 – Machine Learning – Spring 2020 – Data Challenge 3

辅导案例-CS 663

Related

Previous Post辅导案例-CSCI312

Next Post辅导案例-COM1003

Author admin