Skip to main content

辅导案例-CS602-Assignment 5

By May 15, 2020No Comments

CS602 – Data-Driven Development with Python Spring 2020 Programming Assignment 5 1 Programming Assignment 5 Getting started Review class handouts, executing all examples shown in them in Eclipse, as well as creating and typing examples for functions that are listed, but not illustrated with an example. This is essential for your understanding of the Pandas package functionality required to complete this assignment. Complete the reading and practice assignments posted on the course schedule. Programming Project: Recommend worth: 25 points Rating-based movie recommendation. Data and program overview In this assignment you will be working with data on movies and people’s ratings of these movies. The task will be to create movie recommendations for a person, based on the match between personal ratings and critics’ ratings of the movies. I provide two data sets for this assignment, in zip files called and Download and unzip the files in your project folder. Unzipping should result in two folders added to your project folder: data and data-tiny. The following data is provided in csv files: • A table with movie information (IMDB.csv); we will call this the movies data. • A table with ratings of all movies listed in the movies data, by 100 critics (ratings.csv); let’s call this the critics data. The column names in the critics data correspond to the name of each critic. All ratings are integer numbers in the 1..10 range. • A table with one person’s ratings of a subset of the movies in the movies data set (pX.csv), the person data, where X is a number. The name of the person is provided as a column title in the file. Review the data files to familiarize yourself with their content and structure. The program that you write must work as follows. 1. Ask the user to specify the subfolder in the current working directory, where the files are stored, along with the names of the movies, critics, and person data files. 2. Determine and output the names of three critics, whose ratings of the movies are closest to the person’s ratings based on the Euclidean distance metric (as described later within definition of function findClosestCritics()). 3. Use the ratings by the critics identified in item 2 to determine which movies to recommend: • The movie recommendations must be based on the average ratings of movies by the three critics identified in step 2 above, and consist of the highest rated movies in each movie genre1. (see also the definition of function recommendMovies()). 1 Movie genre is determined by the Genre1 column of the movies data. CS602 – Data-Driven Development with Python Spring 2020 Programming Assignment 5 2 4. Display information about recommended movies as described below and illustrated by the sample interactions below. • Recommendations must be listed in alphabetical order by genre. • Missing data (e.g. running time) should not be included. The sample interactions below demonstrate the running of the program. Sample interactions First, let’s use the tiny data set. The interaction below uses personal data file tinyp.csv that contains movie ratings by a person named Kimberwick. Critics Aldbridge, Moon, Benris had the closest recommendations. Please enter the name of the folder with files, the name of movies file, the name of critics file, the name of personal ratings file, separated by spaces: data-tiny tinyIMDB.csv tinyratings.csv tinyp.csv The following critics had reviews closest to the person’s: Aldbridge, Moon, Benris Recommendations for Kimberwick “127 Hours” (Adventure), rating: 8.0, 2010, runs 94 min “50/50” (Comedy), rating: 7.0, 2011, runs 100 min “About Time” (Comedy), rating: 7.0, 2013, runs 123 min The next interaction shows the output given the larger data set Please enter the name of the folder with files, the name of movies file, the name of critics file, the name of personal ratings file, separated by spaces: data IMDB.csv ratings.csv p8.csv The following critics had reviews closest to the person’s: Quartermaine, Arvon, Merrison Recommendations for Catulpa: “Star Wars: The Force Awakens” (Action), rating: 9.67, 2015, runs 136 min “The Grand Budapest Hotel” (Adventure), rating: 9.0, 2014, runs 99 min “The Martian” (Adventure), rating: 9.0, 2015, runs 144 min “How to Train Your Dragon” (Animation), rating: 9.67, 2010 “Kubo and the Two Strings” (Animation), rating: 9.67, 2016 “Hacksaw Ridge” (Biography), rating: 9.33, 2016, runs 139 min “What We Do in the Shadows” (Comedy), rating: 9.0, 2014 “Prisoners” (Crime), rating: 8.33, 2013, runs 153 min “Spotlight” (Crime), rating: 8.33, 2015, runs 128 min “The Perks of Being a Wallflower” (Drama), rating: 9.67, 2012, runs 102 min “Shutter Island” (Mystery), rating: 8.33, 2010, runs 138 mi CS602 – Data-Driven Development with Python Spring 2020 Programming Assignment 5 3 Note that in the above interaction there are sometimes more than one movie listed per genre. As, for instance, is the case with the two Adventure movies, both of them had the highest average rating, hence both are included in the list. Important Notes and Requirements In addition to the requirements stated so far, your code must satisfy the following to gain full credit: • Your program should not use any global variables and should have no code outside of function definitions, except for a single call to main. • All file related operations should use device-independent handling of paths (use os.getcwd() and os.path.join() functions to create paths, instead of hardcoding them). • You must define and use functions specified below in the Required Functions section. You may and should define other methods as appropriate. • You should use the pandas data structures effectively to efficiently achieve the goals of your functions and programs. • The formatting of the recommendation printout should use the length of the longest movie in the list (which should be computed in the program) in formatting the output in a way that aligns categories. Required Functions You must define and use the following functions, plus define and use others as you see fit: a. Function findClosestCritics() which will be used to identify three critics, whose recommendations are closest to the person’s recommendations. The function should take two parameters of type DataFrame, the first one providing data about critics ratings, and the second – about personal ratings. The function must return a list of three critics, whose ratings of movies are most similar to those provided in the personal ratings data, based on Euclidean distance. Euclidean distance of two vectors p = (p 1, p 2, … , p n) and c = (c1, c2, … ,cn) is computed as �(p1 − 1 )2 + (p2 − 2 )2 +⋯+ (p − )2 . To compute how similar ratings of a critic are to the ratings of the person, we compute the distance between a vector, in which the coordinates (c1, c2, … ,cn) are the critic’s ratings of each movie, and the vector composed of the person’s ratings (p 1, p 2, … , p n). The lower the distance, the closer, thus more similar, the critic’s ratings are to the person’s. For example, if the personal data included three ratings (4, 7, 6), where the critic rated the same movies as (4, 5, 6), the Euclidean distance would equal �(4− 4)2 + (5 − 7)2 + (6− 6)2 = 2.0. H int: for this function, create a DataFrame with critics names as its columns, movie titles as its index, and data in each column set to be the difference between the critic’s and the person’s score of each movie, squared. Then, calculate the sum of all column values and find the smallest t
hree values using sorting. Return a list of the associated critics’ names. b. Function recommendMovies () which will be used to generate movie recommendations based on ratings by the chosen critics. The function must accept four parameters: the critics and personal ratings data frames, the list of three critics most similar to the person, and the movie data frame. The function should determine out of the set of movies that are not rated in the personal data, but are rated by the critics, which movies have the highest average of the rating by the most similar critics in each movie genre (specified by the Genre1 column of movie data). In other words, you need to compute CS602 – Data-Driven Development with Python Spring 2020 Programming Assignment 5 4 the top-rated unwatched movies in each genre category, based on the average of the three critics’ ratings. You may assume that the critics data will always be complete, i.e. will include ratings of all movies. The function must then (a) put together information about these top-rated movies sufficient to produce the printout, showing the details of each of the recommended movies as illustrated by the interactions, and (b) return it using some data structure of your choice. H int: An easy way to generate a list of unwatched movies with all critics’ ratings is to merge the person data with the critics data and select the portion of it, which has missing values in the person’s column. After that, to find the highest rated movies per genre, first join the resulting file with the movies data to have full movie information. In this joint DataFrame, compute the average rating by the three critics (storing the average in a new column) and then select the highest rating in each genre (using groupby). After that, select those movies in each genre, that have the rating not lower than the highest in the genre. c. Function printRecommendations () with two parameters: the first containing information about the recommended movies, and the second – the name of the person, for whom the recommendation is made (the name is specified in the header of the personal ratings data file). The function must produce a printout of all recommendations passed in via the first parameter, in alphabetical order by the genre, as shown in the sample interactions. Make sure to examine the sample interactions and implement the details of the printout. The function should return no value. d. Function main(), which will be called to start the program and works according to your design. More Hints • For some csv data, you may not need all of the columns. You can specify which columns to import into a data frame, or you could drop unnecessary ones to improve performance and simplify testing and debugging. • Keeping your data frames indexed by the title should help in making joins easy. Note that the title can be both an index and one of the columns, if necessary. • Make sure to inspect intermediate results via printing them or saving to files. To see the data frames completely when they are printed, you can include the following function calls in your program to set display parameters for pandas pd.set_option(‘display.max_columns’, 1000) pd.set_option(‘display.max_rows’, 1000) pd.set_option(‘display.width’, 1000) • When you open and save the csv files in Excel, Excel may change the encoding used for file characters. To return the encoding to the UTF-8 standard that Python likes, open the file in Notepad, and when saving it, specify the encoding as UTF-8. • Although I have provided a sample small data sets, for testing purposes, I encourage you to create your own one, for which you should know the result in advance. CS602 – Data-Driven Development with Python Spring 2020 Programming Assignment 5 5 Grading The grading schema for this project is roughly as follows: Five points will be awarded for the correct implementation of each of the four functions above (which may call other functions that you define), which uses data structures, methods and functions of the pandas/numpy package appropriately and effectively. Three points will be awarded for making the code sufficiently general to handle different input files, i.e. not tied to the specific content of the files that you are given (though it might be somewhat dependent on the structure of those files, i.e. what is provided by the columns, rows, etc.) Two points will be awarded for style, as defined by the guidelines in Handout 1. Created by Tamara Babaian on November 8, 2019


Author admin

More posts by admin