Skip to main content
留学咨询

辅导案例-CSCC11

By May 15, 2020No Comments

CSCC11: HW2 Due by 11:59pm Sunday, November 10, 2019 University of Toronto Scarborough October 27, 2019 Please submit separate files for a) write-up (named write up hw2.pdf) in PDF (You can convert word doc to PDF or preferably use LaTeX (https://www.overleaf.com/) ), b) Python files and c) figures (if you choose to include them separately from the write-up). All files to be submitted on MarkUs. Do not turn in a hard copy of the write-up. 1. k-means clustering Given the data in the file “customer.csv” implement k-means clustering algorithm (your own imple- mentation for k-means). The data has features (Gender, Age, Annual income) and continuous label (Spending Score). Remember that clustering is an unsupervised learning algorithm! (a) Implement K-means (for k = 2, 3, 4, 5 ) (b) Implement K-means++ (for k = 2, 3, 4, 5 ) (c) Plot the data clusters (use different colors for each cluster). • Input: numpy.ndarray • Output: Cluster plots • Libraries: Use any libraries and functions except built-in function for kmeans. • Functions: – my kmeans(features,k) – returns the clusters – my kmeans plot(clusters) – plots the clusters • Python file name: kmeans hw2.py 2. ROC, AUC and confusion matrix (no programming required for this question) (a) What is an ROC curve and an AUC? Define and explain with examples. List python function(s) that can plot/calculate these. (b) What is a confusion matrix? Define and explain with examples. List python function(s) for it. (c) Balanced and imbalanced dataset. Define and explain with examples. List python function(s) that can be used to solve the imbalanced dataset issue. 1 3. Random forest Implement random forest using any Python built-in functions for the Titanic survivor data (train and test set are provided in csv files). • Input: Train and test sets in numpy.ndarray • Output: Confusion matrix • Libraries: Any library, any function • Functions: No restriction • Python file name: random forest hw2.py 4. PCA Implement PCA (principal component analysis) on the Iris dataset and reduce it to 3 features and 2 features. Load Iris dataset (all instances) from python using the commands from sklearn import datasets iris = datasets.load iris() X = iris.data • Input: numpy.ndarray (150 by 4) • Output: Two reduced dimension matrices (150 by 3) and (150 by 2) and data plot • Libraries: Use any libraries and functions except built-in function for PCA. • Functions: – my pca(data matrix, k) – this returns and prints low dim matrix – my pca plot(low dim matrix) – plots the low dim data • Python file name: pca hw2.py 2

admin

Author admin

More posts by admin