辅导案例-P7

2020/4/8 P7: COVID-19 Growth Trend Clustering https://canvas.wisc.edu/courses/176717/assignments/804556 1/4 P7: COVID-19 Growth Trend Clustering Due Thursday by 9:29am Points 100 Submitting a file upload File Types py Available until Apr 9 at 9:29am Submit Assignment NOTE: A tie-breaking protocol has been added to the HAC description below. Assignment Goals Implement hierarchical clustering Process real-world data of particular contemporary relevance Summary There is a lot of analysis happening with the various datasets for COVID-19 right now. One of the goals of these analyses is to help figure out which countries are “beating” the pandemic. The Ten Hundred Plot 2020/4/8 P7: COVID-19 Growth Trend Clustering https://canvas.wisc.edu/courses/176717/assignments/804556 2/4 Using the publicly available Johns Hopkins COVID-19 data, you’ll be performing clustering on time series data for different regions in the world. Each region is defined by a row in the data set, which can be a country, a province, etc. The time series data represents the number of (cumulative) confirmed cases on each day. Because different regions have different onset of the outbreak and differ in magnitude, it is often desirable to convert a raw time series into a shorter feature vector. For this assignment, you will represent a time series by two numbers: the “x” and “y” values in the above ten-hundred plot video. After each region becomes that two-dimensional feature vector, you will cluster all regions with HAC. Program Speci cation Download the data in CSV format: time_series_covid19_confirmed_global.csv This is a snapshot of the data from the morning of April 2, 2020. If you wish to test your code on current data, the Johns Hopkins University git repo (https://github.com/CSSEGISandData/COVID- 19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv) has constantly-updating data, but we will only be testing your code with the snapshot. Write the following Python functions: 1. load_data(filepath) — takes in a string with a path to a CSV file formatted as in the link above, and returns the data (without the lat/long columns but retaining all other columns) in a single structure. 2. calculate_x_y(time_series) — takes in one row from the data loaded from the previous function, calculates the corresponding x, y values for that region as specified in the video, and returns them in a single structure. Notes: 1. The “n/10 day” is the latest day with LESS THAN OR EQUAL TO n/10 cases, similarly for the “n/100 day”. 2. If “n/10 day” is day i, and today is day j, then x=j-i, not j-1+1. 3. Some x or y can be NaN if the time series doesn’t contain enough growth. 4. There is a link to Matlab code in the video description on YouTube, please consult that for the precise definition of x and y. 3. hac(dataset) — performs single linkage hierarchical agglomerative clustering on the regions with the (x,y) feature representation, and returns a data structure representing the clustering. You may implement other helper functions as necessary, but these are the functions we will be testing. Load Data Read in the file specified in the argument (the DictReader from Python’s csv module (https://docs.python.org/3/library/csv.html#csv.DictReader) will be of use) and return a list of dictionaries, where each row in the dataset is a dictionary with the column headers as keys and the row elements as values. These dictionaries should not include the lat/long columns, as we will not be using them in this 2020/4/8 P7: COVID-19 Growth Trend Clustering https://canvas.wisc.edu/courses/176717/assignments/804556 3/4 program, but retain the province (possibly empty) and country columns so that data points can be uniquely identified. You may assume the file exists and is a properly formatted CSV. Calculate Feature Values This function takes in the data from a single row of the raw dataset as read in the previous function (i.e. a single dictionary, without the lat/long values but retaining all other columns). As explained in the video above, this function should return the x, y values in a tuple, formatted as (x, y) . Perform HAC For this function, we would like you to mimic the behavior of SciPy’s HAC function (https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) , linkage() . You may not use this function in your implementation, but we strongly recommend using it to verify your results! Input: A collection of m observation vectors in n dimensions may be passed as an m by n array. All elements of the condensed distance matrix must be finite, i.e. no NaNs or infs. If you follow the Matlab code from the YouTube video description, you will occasionally have NaN values — such rows should be filtered out within this function and should not count toward your total number of regions. In our case, m is the number of regions and n is 2: the x and y features for each region. Using single linkage, perform the hierarchical agglomerative clustering algorithm as detailed on slide 19 of this presentation (http://pages.cs.wisc.edu/~jerryzhu/cs540/handouts/introML.pdf) . Use a standard Euclidean distance function for calculating the distance between two points. Output: An (m-1) by 4 matrix Z . At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster m + i. A cluster with an index less than m corresponds to one of the m original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2] . The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster. That is: Number each of your starting data points from 0 to m-1. These are their original cluster numbers. Create an (m-1)x4 array or list. Iterate through the list row by row. For each row, determine which two clusters you will merge and put their numbers into the first and second elements of the row. The first point listed should be the smaller of the two cluster indexes. The single-linkage distance between the two clusters goes into the third element of the row. The total number of points in the cluster goes into the fourth element. If you merge a cluster containing more than one data point, its number (for the first or second element of the row) is given by m+the row index in which the cluster was created. Before returning the data structure, convert it into a NumPy matrix. 2020/4/8 P7: COVID-19 Growth Trend Clustering https://canvas.wisc.edu/courses/176717/assignments/804556 4/4 If you follow these guidelines for input and output, your result should match the result of scipy.cluster.hierarchy.linkage() and you can use that function to verify your results. Be aware that this function does not contain code to filter NaN values, so this filtering should be performed before calling the function. Tie Breaking In the event that there are multiple pairs of points with equal distance for the next cluster: Given a set of pairs with equal distance {(x , x )} where i < j, we prefer the pair with the smallest first cluster index i. If there are still ties (x , x ), ... (x , x ) where i is that smallest first index, we prefer the pair with the smallest second cluster index. Be aware that this tie breaking strategy may not produce identical results to scipy.cluster.hierarchy.linkage() . Challenge Options If you wish to continue exploring the data, we challenge you to implement a complete linkage option and compare the results with single linkage. You may also wish to use the results of your HAC algorithm to choose an appropriate k for k-means, and implement that clustering as well. Do you get the same clusters every time? If not, how do they differ? Is there anything meaningful there, do you think? Submission Please submit your code in a file called ten_hundred.py . All code should be contained in functions or under a if __name__=="__main__": check so that it will not run if your code is imported to another program. i j i j i k

辅导案例-P7

Related

Previous Post辅导案例-CS4233

Next Post辅导案例-COMP20007-Assignment 1

Author admin