辅导案例-CS 475

Fall 2019 CS 475 Machine Learning: Homework 5 1 CS 475 Machine Learning: Homework 5 Dimensionality Reduction, Structured Prediction, and FATE Due: Thursday December 5, 2019, 11:59pm 100 Points Total Version 1.0 Make sure to read from start to finish before beginning the assignment. Introduction This assignment will consist of three parts: • (45 points) – A coding portion, where you will implement two dimensionality re- duction techniques: PCA and LLE. You will also implement a K-Nearest Neighbors classifier to operate over your dimensionality reduced data. • (15 points) – A Jupyter notebook where you will explore bias in word embeddings, and implement a debiasing technique. • (40 points) – Written questions about HMMs, CRFs, RNNs and FATE. 1 Programming (45 points) 1.1 K-Nearest Neighbor Classifier You will implement a K-Nearest Neighbors Classifier (KNN). This is a very simple su- pervised model, which takes training data xi ∈ X , which is a finite set of d-dimensional vectors. Each xi is paired with a label yi. “Training” a KNN is trivial: simply store the data. All of the “work” is done at prediction time. To predict a label for an example x 1. Find the k nearest examples: K(x) = x(1), .., x(k) ∈ X where distance is measured by Euclidean distance to x. 2. Infer the label for ŷ as the most common label of y(1), …, y(k), where y(i) is the label for x(i) ∈ K(x). In case of a tie, select the label with the lowest integer index. We have implemented the fit function for you (it just saves the data in the model object). Please implement the predict function for the KNN model in models.py. You should be able to run this without specifying a dr-algorithm (dimensionality reduction algorithm), which will just run a KNN model over your raw data. For example, the command: python3 main.py –knn-k 5 –test-split data/dev –predictions-file simple-knn-preds.txt should output line-by-line predictions for the file dev into simple-knn-preds.txt, similar to past homeworks. You could then evaluate your predictions by running: python3 accuracy.py –labeled data/labels-mnist-dev.npy –predicted simple-knn-preds.txt Fall 2019 CS 475 Machine Learning: Homework 5 2 Since KNN uses distance for prediction, larger dimensional data will be slower, and also suffer from the curse of dimensionality. We ask you to implement KNN as it will benefit from your dimensionality reduction. For example, you could run your KNN model on the regular 748-dimension MNIST data. But this might take a while! We’re going to attempt to fix that in the next two sections. 1.2 Principal Component Analysis We discussed PCA as a dimensionality reduction technique in class. PCA is one of many different techniques, and the choice of method depends on the data and intended use of the reduced dimension dataset. PCA relies on eigendecomposition, which factorizes a matrix into a canonical form consisting of two parts – eigenvalues and eigenvectors (see https://en.wikipedia.org/ wiki/Eigendecomposition_of_a_matrix). The eigenvectors determine the directions of the new feature space, and the eigenvalues determine their magnitude. First, preprocess the data matrix: standardize each feature in the dataset to unit scale, defined as having a mean of zero and a variance of 1. Note that this is for each feature index, not the example or the entire dataset. You do this by subtracting the mean from all points and then dividing all points by the standard deviation. Since PCA yields a feature space that maximizes variance along the axes, it’s important to ensure all features are measured in the same scale. (If the standard deviation is 0, which can happen in MNIST, you should not divide at all. This operation will leave you with every feature for this dimension being 0, which it probably was already.) Second, calculate the covariance matrix Σ, which is a D ×D matrix (where D is the number of features). Each element of the matrix Σ represents the covariance between two features. For a dataset of size N , each element in the matrix is defined as Σst = 1 N − 1 N∑ i=1 (xis − x¯s)(xit − x¯t) (1) Third, we are now ready for PCA, which uses an eigensolver to perform eigendecom- position on the covariance matrix (see Section 1.2.1 for implementation details). For some square matrix A, it can be factorized as A = QΛQ−1 (2) where Q are the eigenvectors and Λ are the eigenvalues. Finally, select which eigenvectors to use in our lower-dimensional space by using the top d corresponding eigenvalues (they will be sorted by absolute value). The eigenvalues indicate the amount of information the eigenvectors contain about the variance within the overall data. With these eigenvectors, we can build a weight vector W, which is a D × d weight matrix. Finally, we can calculate our reduced feature space Y, Y = XW (3) The result is a reduced dimension data matrix Y of size N × d, where N is the number of data points. Fall 2019 CS 475 Machine Learning: Homework 5 3 Figure 1: From https://link.springer.com/article/10.1007%2Fs11692-018-9464-9, an explanation of the linearity of PCA 1.2.1 Implementation Details Most of PCA can be implemented using straightforward numpy operations. The most challenging step is the eigendecomposition. We recommend using the scipy’s eigensolver, https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.eig.html. Additionally, you may use any function in either numpy or scipy. You should implement PCA as a subclass of Model. It’s fit function will take in both your training and test data and jointly reduce all of your data points together. Fit should then return the dimensionality reduced data in the same order that it’s passed in. This is important because we will split the data back into training and test splits to perform KNN. You do not need to implement the predict function for this model. Here is an example of how your model should run after you’ve implemented PCA: python3 main.py –knn-k 5 –dr-algorithm pca –target-dim 300 –test-split dev \ –predictions-file pca-knn-preds.txt You should find this runs a lot faster than without dimensionality reduction! 1.3 Locally Linear Embedding (LLE) While PCA strictly models linear variabilities in the data (see Figure 1), this strategy fails when the data lie on a low-dimensional “manifold” (https://en.wikipedia.org/wiki/ Manifold), which can be a non-linear embedding of the data. Non-linear dimensionality reduction methods attempt to find this manifold and produce a low-dimensional space that encodes distances and relative position on the manifold. Locally Linear Embedding (LLE) was introduced by Roweis and Saul in 2000. The original paper was published in Science, but you can find a more introductory tutorial here: https://cs.nyu.edu/~roweis/lle/papers/lleintro.pdf. The key insight to LLE is that if you have enough data points on a manifold, we can assume a point and its Fall 2019 CS 475 Machine Learning: Homework 5 4 Figure 2: An illustration of LLE from http://science.sciencemag.org/content/290/5500/ 2323 nearest neighbors lie in a patch of the manifold that is approximately linear. Thus, LLE aims to reduce dimensionality by finding linear weights that best reconstruct each point from its neighbors, and then finding low-dimensional points for which these same weight matrices apply. LLE is illustrated in Figure 2. You should begin by preprocessing your data to have features normalized with mean 0 and variance 1 as in PCA. Then the LLE algorithm can be broken down into 3 steps: 1. For each data point xi, determine the K nearest neighbors (where K is a hyperpa- rameter fixed ahead of time) 2. For each data point xi, compute the weights wij that best reconstruct xi from its neighbors (each neighbor is xj, and ∑ j is a sum over neighbors of xi), using the reconstruction error function ε(W) = ∑ i |xi − ∑ j wijxj|2 (4) 3. Compute the lower-dimensional vectors yi that are best reconstructed by the wij computed in step 2, using an eigensolver to find the bottom nonzero eigenvectors to minimize the embedding cost function (same form as in step 2, but now wij
fixed instead of free) Φ(Y) = ∑ i |yi − ∑ j wijyi|2 (5) We use X to refer to the input data matrix, which consists of N columns of data points of dimensionality D. We will be returning a low-dimensional embedding matrix Y, which is d x N where d D. xi and yi refer to the ith column of X and Y respectively, which corresponds to the ith data point. K is the number of neighbors we will consider. W is the reconstruction weight matrix, and wij is the entry corresponding to the i th row and jth column. Fall 2019 CS 475 Machine Learning: Homework 5 5 We now present the pseudocode, in the same three high-level steps outlined above.1 We use for loops for simplicity of pseudocode, but we encourage you to consider numpy speed-ups. 1. Find nearest neighbors for i ∈ 1 . . . N do Keep track of the indices of the K nearest neighors as an array n (1) i , . . . n (k) i for j ∈ 1 . . . N do Calculate the Euclidean distance between xi and xj if xj is in K nearest then Assign j as a neighbor of i end if end for end for 2. Solve for reconstruction weights Initialize reconstruction weight matrix W as N x N matrix of zeros. for i ∈ 1 . . . N do Initialize Z as D x K matrix. for k ∈ 1 . . . K do Set the kth column of Z to be x n (k) i − xi end for Compute local covariance C = ZTZ Regularization term = 1e−3 × trace(C) C = C + × I Solve the linear system Cŵ = 1 for ŵ for k ∈ 1 . . . K do w in (k) i = ŵk/sum(ŵ) where ŵk is the k th element of ŵ end for end for 3. Compute embedding coordinates Create sparse matrix M = (I−W)T (I−W) Find bottom d+ 1 eigenvectors of M (corresponding to the smallest eigenvalues) Remove the eigenvector with eigenvalue 0 Set the rows of Y to be the remaining d eigenvectors When we refer to “bottom eigenvectors,” we mean the eigenvectors with the smallest eigenvalues. “Top eigenvectors” have the largest. We can get these easily from any standard eigensolver, which we’ll detail more in the next section. Because M is symmetric, it will have only real eigenvalues and eigenvectors, so you don’t have to worry about complex numbers here. 1Our pseudocode is adapted from the authors’ web page, https://cs.nyu.edu/~roweis/lle/algorithm.html. Fall 2019 CS 475 Machine Learning: Homework 5 6 1.3.1 Implementation Details i Just as a small tip, LLE can take a while to train on all 3,000 train examples + 1,000 test examples. It might be a good idea to reduce it down to 1,000 examples total while you’re developing the algorithm to help you debug. ii Because we’re working with a symmetric matrix, you’ll want to use the eigensolver pro- vided with scipy here: https://docs.scipy.org/doc/scipy/reference/generated/ scipy.sparse.linalg.eigsh.html#scipy.sparse.linalg.eigsh. Because we’re look- ing for the eigenvectors corresponding to the smallest eigenvalues, we’ll want to use the flag sigma=0.0 (this seems to work better than which=’sm’). iii Because the eigensolver is a numerical approximation, it is not guaranteed the recover exact zeros in the bottom eigenvalues. You may use np.finfo(float).eps as the lower bound. iv Just as you implemented a subclass of Model for PCA, you’ll do the same for LLE with the same behavior. v It is not always trivial to decide k (the number of nearest neighbors) in LLE. For this assignment, you can use k = 10. You should feel free to play with this, but this setting should work. As before, here is an example of how we might run your code python3 main.py –knn-k 5 –dr-algorithm lle –lle-k 10 –target-dim 300 –test-split dev \ –predictions-file pca-knn-preds.txt 1.4 Programming Details As before, your code will be evaluated on it’s predictions on held-out test data. We have provided you with the test-data, so you can ensure that your code does not fail when running over it, but not the labels. You may compare the scores of your classifier using different dimensionality reduction techniques on development data with your peers on Piazza. To make sure that your predictions output is in the correct format, you should test that accuracy.py can run on your dev predictions, as described in the KNN instructions. We will evaluate your code in 2 settings (1) classification accuracy on MNIST test data when using PCA dimensionality reduction, and (2) classification accuracy on MNIST test data when using LLE dimensionality reduction. 1.4.1 Data You will be working with a subset of the popular MNIST 2 digit recognition dataset. This dataset is a multiclass classification dataset which is composed of 10 classes (0-9). Each example is a 28 by 28 greyscale image of a handwritten digit (0-9). Our version of the dataset comes in 3 splits, of which you have 5 files: mnist-train.npy, mnist-dev.npy, mnist-test.npy, labels-mnist-train.npy, labels-mnist-dev.npy The training set consists of 3,000 labeled examples of hand written digits. The dev and test splits each contain 1,000 examples. While this is not a lot of training data (the actual MNIST dataset has 60,000 examples), we’ve cut things down considerably to help 2https://en.wikipedia.org/wiki/MNIST_database Fall 2019 CS 475 Machine Learning: Homework 5 7 you run your code faster. You might be surprised at the accuracy of a simple KNN model in this setting. Your implementations should not take longer than 7-8 minutes to run with knn-k=5 and lle-k=10 (LLE will be much longer runtime than PCA). Additionally, your data should contain glove.6B.100d.txt, which you will use for the Notebook portion of this assignment. 1.4.2 Command Line Arguments Your code should contain the following command line arguments, which you should not change: • –data: Points to the directory containing the MNIST data splits. • –predictions-file: The file that your predictions will be stored in. • –test-split: The split (test or dev) that your KNN model will predict over. • –knn-k: A hyperparameter for your KNN model. How many nearest neighbors to consider during prediction. • –dr-algorithm: The type of dimensionality reduction algorithm to use (pca or lle). • –target-dim: The number of dimensions to keep with your dr technique. • –lle-k: A hyperparameter for your lle algorithm, as described above. Fall 2019 CS 475 Machine Learning: Homework 5 8 2 Jupyter Notebook (15 points) Lastly, you will explore bias in word embeddings, and implement a debiasing method in the jupyter notebook we’ve provided to you called FATE.ipynb. Please complete the notebook by filling in all the TODOs in the coding cells, and answering the questions with red FILL IN text. You will submit your notebook to Gradescope. Fall 2019 CS 475 Machine Learning: Homework 5 9 3 Analytical (40 points) Question 1) (15 points) Expressive power of sequence classifiers Consider the fol- lowing sequence classification task. We are given an arbitrarily long sequence of tokens from the alphabet X = {(, ), a, b, c, . . . , z}. In other words, each sequence x ∈ X ? where (·)? denotes the Kleene closure operator. Our sequence classification task is to predict whether or not a given x is a well formed—meaning that the parentheses are balanced. Let Y = {Yes, No} denote the prediction. Examples : x1, . . . , x5 = ( ( ( ( ( ⇒ y = No x1, . . . , x3 = ) ( a ⇒ y = No x1, . . . , x4 = ( ) ( ) ⇒ y = Yes x1, . . . , x8 = ( a ( b ) ( ) ) ⇒ y = Yes Notice that we are not predicting a y label for each position, we are classifying the entire sequence with a y label. However, the model for predicting the final classification will make use of hidden states at each position, z1 . . . zT where T = |x|. Each latent state zt comes from a set Z, the specific set will depend on the model family. For HMMs and CRFs, Z is a discrete set of symbols of some fixed size. For RNNs, Z is a real- valued vector of some fixed dimensionality. The latent state may be used to track the unobserved properties of the input sequence (e.g., some sort of latent representation of how the parentheses in the expression are matched). When we are given a new sequence to classify x, we infer the hidden z states according to our model parameters. We use the final state zT , to predict whether or not the sequence
was well formed. You can make this prediction by either examining the final state itself, or using the final state as input to a classifier that predicts Yes or No. 1.1: For each of the following family of sequence models, state as to whether there exists model parameters that can accurately capture the desired behavior. Explain. (a) Hidden Markov Model Fall 2019 CS 475 Machine Learning: Homework 5 10 (b) Conditional Random Field (c) Recurrent Neural Network 1.2: How would your answers to question 1.1 change if the sequences where upper bounded in length, |x| ≤ Tmax Fall 2019 CS 475 Machine Learning: Homework 5 11 Question 2) CRF vs. HMM (15 points) Consider the following sequence models: • HMM with K = |Z| latent states and M = |X | possible observations. • Linear chain CRF over the same state space as the HMM, but with feature function f(x, yt, yt−1) ∈ RD. For the following questions give a big-O expression that is a function of K, D, M , and input length T . (a) Computing the posterior distribution over a specific label pHMM(Yt = y | x) (b) Computing the posterior distribution over a specific label pCRF(Yt = y | x) (c) How many parameters are in the HMM? Fall 2019 CS 475 Machine Learning: Homework 5 12 (d) How many parameters are in the CRF? (e) List one advantage and one disadvantage of CRF over HMMs. Fall 2019 CS 475 Machine Learning: Homework 5 13 Question 3) FATE (10 points) Suppose you work for a bank and you have been tasked with building a machine learning system that automatically approves or rejects credit card applications. You are given a dataset that contains a sample of previously processed applications, for which a human decision maker decided to approve or reject. Each ex- ample contains information from the application, such as education level, zip code, and income level, as well as a label indicating the decision (accept/reject). The demographics of the applicant (gender, age, race, and ethnicity) are not included in the dataset or as features in the data. (a) Despite the fact that demographic information is not included in the dataset, you are concerned that the resulting classifier may be biased against a demographic group. Name two potential sources of such bias. (b) Take the same dataset, but add the observation of whether or not the applicant kept in good credit standing for 5 years after having their application approved (with rejected candidates labeled as “unknown, due to rejection”). Suppose we trained to predict the good credit standing rather than the person’s judgement. How would that change the potential for bias? Explain. Fall 2019 CS 475 Machine Learning: Homework 5 14 (c) Would using an interpretable machine learning model necessarily reduce bias and increase fairness? Fall 2019 CS 475 Machine Learning: Homework 5 15 4 What to Submit In this assignment you will submit two things. 1. Submit your code (.py files) to cs475.org as a zip file. Your code must be uploaded as code.zip with your code in the root directory. By ‘in the root directory,’ we mean that the zip should contain *.py at the root (./*.py) and not in any sort of substructure (for example hw1/*.py). One simple way to achieve this is to zip using the command line, where you include files directly (e.g., *.py) rather than specifying a folder (e.g., hw1): zip code.zip *.py A common mistake is to use a program that automatically places your code in a subfolder. It is your job to make sure you have zipped your code correctly. We will run your code using the exact command lines described earlier, so make sure it works ahead of time, and make sure that it doesn’t crash when you run it on the test data. A common mistake is to change the command line flags. If you do this, your code will not run. Remember to submit all of the source code, including what we have provided to you. We will include requirements.txt and provide the data, but nothing else. 2. Submit your writeup to gradescope.com. Your writeup must be compiled from latex and uploaded as a PDF. The writeup should contain all of the answers to the analytical questions asked in the assignment. Make sure to include your name in the writeup PDF and to use the provided latex template for your answers following the distributed template. You will submit this to the assignment called “Homework 5: Dimensionality Reduction, Structured Prediction, and FATE: Written”. 3. Submit your Jupyter notebook to gradescope.com. Create a PDF version of your notebook (File → Download As → PDF via LaTeX (.pdf)). Be sure you have run the entire notebook so we can see all of your output. You will submit this to the assignment called “Homework 5: FATE: Notebook”. 5 Questions? Remember to submit questions about the assignment to the appropriate group on Piazza: https://piazza.com/class/jkqbzabvyr15up.

辅导案例-CS 475

Related

Previous Post辅导案例-MAE3100

Next Post辅导案例-COMP202-Assignment 4

Author admin