- June 13, 2020

Assignment 2 – Foundations of Machine Learning CSCI3151 – Dalhousie University Q1 (30%) Gradient descent – Logistic regression In this question we are going to experiment with logistic regression. This exercise focuses on the inner workings of gradient descent using a cross-entropy cost function as it was learned in class. a) Using the pima indians data set, first separate a random 20% of your data instances for validation. Then, apply a feature selection algorithm based on evaluating feature importance using Pearson correlation (scipy documentation). Extract the top two most important features based on this measure. b) We want to train a logistic regression model to predict the target feature Outcome. It’s important that no other external package is used (pandas, numpy are ok) for this question part. We want to find the weights for the logistic regression using a hand made gradient descent algorithm. We will use cross-entropy as the cost function, and the logistic cross-entropy to compute the weight update during gradient descent. It is OK to reuse as much as you need from the code you developed for Assignment 1. Differently from what we did for Assignment 1, we are now using a random 20% of your data instances for validation. Your function should be able to return the updated weights and bias after every iteration of the gradient descent algorithm. Your function should be defined as follows: def LRGradDesc(data, target, weight_init, bias_init, learning_rate, max_iter): And it should print lines as indicated below (note the last line with the weights): Iteration 0: [initial_train cost], [train accuracy], [validation accuracy] Iteration 1: [train cost after first iteration], [train accuracy after first iteration], [validation accuracy after first iteration] Iteration 2: [weights after second iteration], [train cost after second iteration], [train accuracy after second iteration], [validation accuracy after second iteration] … Iteration max_iter: [weights after max_iter iteration], [train cost after max_iter iteration], [train accuracy after max_iter iteration], [validation accuracy after max_iter iterations] Final weights: [bias], [w_0], [w_1] Note that you may want to print every 100 or every 1000 iterations if max_iter is a fairly large number (but you shouldn’t have more iterations than the indicated in max_iter). c) Discuss how the choice of learning_rate affects the fitting of the model. d) Compare your model with one using a machine learning library to compute logistic regression. e) Retrain your model using three features of your choice. Compare both models using an ROC curve (you can use code from here to draw the ROC curve) Q2 (30%) Multi-class classification using neural networks In this question you will experiment with a neural network in the context of text classification, where a document can belong to one out of several possible categories. The main goal for you is to try different hyperparameters in a systematic manner so that you can propose a network configuration that is properly justified. You will experiment with the Reuters dataset, which can be loaded directly from Keras: from keras.datasets import reuters (train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000) a) Experiment with different hyper-parameters and report your best accuracy found. The most important hyperparameters that you need to experiment with in this question part are: number of layers, nodes per hidden layer, learning rate, and number of epochs. b) Describe how your convergence changes when you vary the size of your mini-batch. A plot showing cost in terms of number of epochs would be enough. Discuss the reasons for this. c) Experiment with different regularization options (e.g. L2 and dropout).You may need to make your network larger in case you don’t find much benefits from applying regularization. Note: we recommend you to control your initialization parameters by means of a seed https://keras.io/api/layers/initializers/. Q3 (10%) Computational graph (no code involved) This question aims at checking your understanding on defining arbitrary network architectures and compute any derivative involved for optimization. Consider a neural network with N input units, N output units, and K hidden units. The activations are computed as follows: where σ denotes the logistic function, applied elementwise. The cost involves a squared difference with the target s (with a 0.5 factor) and a regularization term that accounts for the dot product with respect to an external vector r. More concretely: a) Draw the computation graph relating x, z, h, y, , , and . b) Derive the backpropagation equations for computing ∂ /∂ . To make things simpler, you W (1) may use σ’ to denote the derivative of the logistic function. Q4 (30%) Tuning generalization In this question you will construct a neural network to classify a large set of low resolution images. Differently from Q2, in this case we suggest you a neural network to start experimenting with, but we would like you to describe the behavior of the network as you modify certain parameters. You will be reproducing some concepts mentioned during the lectures, such as the one shown on slide 8, of the lecture on “Ensembles, regularization and feature selection” from Week 4. a) Use the CIFAR-100 dataset (available from Keras) from keras.datasets import cifar100 (x_train_original, y_train_original), (x_test_original, y_test_original) = cifar100.load_data(label_mode=’fine’) to train a neural network with two hidden layers using the ReLU activation function, with 500 and 200 hidden nodes, respectively. The output layer should be defined according to the nature of the targets. a) Generate a plot that shows average precision for training and test sets as a function of the number of epochs. Indicate what a reasonable number of epochs should be. b) Generate a plot that shows average precision for training and test sets as a function of the number of weights/parameters (# hidden nodes). For this question part, you will be modifying the architecture that was given to you as a starting point. c) Generate a plot that shows average precision for training and test sets as a function of the number of instances in the training set. For this question part, you will be modifying your training set. For instance, you can run 10 experiments where you first use a random 10% of the training data, a second experiment where you use a random 20% of the training data, and so on until you use the entire training set. Keep the network hyperparameters constant during your experiments. d) Based on all your experiments above, define a network architecture and report accuracy and average precision for all classes. e) Can you improve test prediction performance by using an ensemble of neural networks? Submitting the assignment (REVISED) Note that you will have four separate Assignments 2 on Brightspace, i.e. one for each question (A2-Q1, A2-Q2, A2-Q3 and A2-Q4) 1. Your assignment as a single .ipynb file including your answers should be submitted for each question before the deadline on Brightspace. Use markdown syntax to format your answers. 2. You can submit multiple editions of your assignment. Only the last one will be marked. It is recommended to upload a complete submission, even if you are still improving it, so that you have something into the system if your computer fails for whatever reason. 3. IMPORTANT: PLEASE NAME YOUR PYTHON NOTEBOOK FILE AS: –Assignment-N-Q.ipynb, for example Soto-Axel-Assignment-2-1.ipynb (for the first question of the second assignment) A penalty applies if the format is not correct. 4. The markers will enter your marks and their overall feedback on Brightspace. In case that there is any important feedback, it will be given to you, but otherwise you would need to refer to the model solutions. Marking the assignment Criteria and weights. Each criterion is marked by a letter grade. Overall mark is the weighted average of the grade of each criterion. For the experimental questions: 0.2 Clarity: All steps are clearly described. The origin of all code used is clearly. Markdown is used effectively to format the answer to make it easier to read and grasp the main points. Links have been added to all online resources used (markdown syntax is: [AnchorText](URL) ). 0.2 Justification: Parameter choices or processes are well justified. 0.2 Results: The results are complete. The results are presented in a manner that is easy to understand. The answer is selective in the amount and diversity of the experimental results presented. Only key results that support the insights are presented. There is no need to present every single experiment you carried out. Only the interesting results are presented, where the behaviour of the ML model varies. 0.4 Insights: The insights obtained from the experimental results are clearly explained. The insights are connected with the concepts discussed in the lectures. The insights can also include statistical considerations (separate training-test data, cross-validation, variance).Preliminary investigation of the statistical properties of the attributes (e.g. histogram, mean, standard deviation) is included. For the theoretical questions (Q3): 0.6 Correctness: Correctness of the answer. Explanation is clear and precise. 0.4 Neatness of explanation: Explanation is well written, well structured and easy to read. It uses well defined and consistent notation.