辅导案例-CSE142

CSE142-Fall 2019 Project Predicting Star Ratings from User Reviews Handed out Date: Nov 6, 2019 Evaluation deadline: Dec 4, 2019 Report and code due date: Dec 6, 2019 • The project has to be done in groups of 3. • 10% of the points are for write-up on the group’s diversity. • We recommend implementing all the codes in Python3. • You are allowed to use any external library including Machine Learning and Natural Language Processing libraries. • One (and only one) member of the group has to submit the project report using his/her account on Canvas. All group members will get points for that submission. • How to submit your solutions: Your project report must be typed up separately (in at least an 11-point font) and submitted on the Canvas website as a PDF file. The code and related files should be submitted in a .zip file. • Your report should clearly mention your group’s number (from the shared project sign up sheet), and the names, email addresses and student ids of all group members. • You are very strongly encouraged to format your report in LATEX. You can use other software but hand- written reports are not acceptable. • The Computer Science and Engineering Department of UCSC has a zero-tolerance policy for any incident of academic dishonesty. If cheating occurs, consequences within the context of the course may range from getting zero on a particular assignment, to failing the course. In addition, every case of academic dishonesty will be referred to as the student’s college Provost, who sets in motion an official disciplinary process. Cheating in any part of the course may lead to failing the course and suspension or dismissal from the university. 1 1 Course Project [100 points] The rise in E-commerce has brought a significant rise in the importance of customer reviews. There are hundreds of review sites online and massive amounts of reviews for every product. The ability to successfully decide whether a review will be helpful to other customers and thus give the product more exposure is vital to companies that support these reviews. This project is about automatically identifying the appropriate ratings for a given review. Specifically, the Machine Learning classification task is as follows: given an input text (review), you have to predict the respective ratings. (from 1 to 5) 1.1 Dataset The training dataset provided to you is the modified Yelp Open dataset. The dataset consists of reviews and their respective ratings in JSON format. You will be provided: • data train.json: – This dataset has around 330000 entries. – Each entry consists of a review in multiple sentences, corresponding rating and also the usefulness of the review. – There are five fields – ‘stars’, ‘useful’,‘funny’,‘text’ – You need to predict ‘stars’ (ratings) from the ‘text’ (reviews) – If you think it helps, you may use the attributes of the reviews about them being useful or funny! – You can download this file from this link this link • data test.json: – This contains only the reviews and their attributes. The ratings will not be provided for this dataset. The test set will be provided shortly before the project evaluation deadline. 1.2 Evaluation Your trained model will be evaluated on a held-out and hidden test set. As mentioned above, the goal at test time is to predict a rating of each review in the test set. The test set will be provided to you on the day of the evaluation. Your code should take as input the test file with no labels and output predictions in a .csv file. We will evaluate your predictions against the ground truth (hidden from you at all times) using the following performance measures: Accuracy, Precision, Recall, and F1-score. Your system should be able to accept such a file as input. Note that the file is in JSON format. Also, ‘text’ entry of each data point represents a review and it can contain punctuation marks including commas and quotation marks. Your predictions file (output) should contain only one column (predicted rating). The file should be comma-separated (it should not contain any other punctuation marks like quotation marks). Outputs that do not conform to this format will not be evaluated. Please see the template file “data test wo label template.json”. This is just a template test file with 5 entries. The actual test file that we will release near the end will have around 50,000 entries. We also provide the template for output prediction file. See “predictions.csv”. There should be one column with header predictions. The prediction numbers should NOT be quoted. Please make sure your prediction file format is similar to the one provided as template. 1.3 Report Page Limit: 3 pages You are also expected to write a short report on your findings. The report will describe the details of your approach like the data cleaning/pre-processing, feature extraction, model details, and experiments done to build 2 your model. The first section of the report should be titled ‘Tools used’ and should list all the tools/libraries that you use for the project. In the report, indicate whether you wrote code for a particular step or used a library. For example, if you try Logistic Regression, when describing your approach indicate if you used a library or coded the algorithm. The first page should also contain a small paragraph on diversity. Diversity of the group can be based on a variety of factors and as mentioned in class you don’t have to limit yourself to race/gender. Talk to your teammates and find out how you might be different from them. You will be evaluated on your description of diversity. A typical report would have the following components/sections, but feel free to customize the suggested components according to your project. Required Components: 1. Title 2. Group details (full names, email addresses and student ids of all group members) 3. Tools Used (including a short 1-2 sentence description of what they were used for) 4. Diversity Suggested Components: 1. Abstract (1 paragraph summary of your approach and key findings) 2. Data Pre-processing 3. Feature Extraction 4. Approach(es) 5. Experimental Set-up 6. Results 7. Conclusion 8. Ideas for future work 1.4 What to Submit: 1. Report (.pdf file) to be submitted on Canvas. 2. 〈Names〉 code.zip: This file should contain any code that you write for the project. It should contain a ReadMe and the code should be properly documented. This file should be submitted on Canvas with your report. 3. 〈Names〉 predictions.csv containing the predictions of your model on the provided test set. Please note that we will not be able to evaluate your predictions if your predictions file is not in the correct format. In the above description, 〈Names〉 should be replaced by the last names of all group members in alphabetical order. For example, if there were two members in the group named: Joe Smith and Mary Johnson, then the zip file would be named JohnsonSmith code.zip. 3

辅导案例-CSE142

Related

Previous Post辅导案例-MATH3871/MATH5960

Next Post辅导案例-CSE 5523

Author admin