- August 1, 2020

PS4 Coding This assignment will have us looking to build a deep convolutional neural network, similar to the architecture of LeNet-5 This network will use the softmax function to make a 10 class image classification on the MNIST data set (the original MNIST and not the fashion_mnist we’ve been working with thusfar) But before you get started, please make sure you have the following packages installed Packages to install: 1. numpy 2. keras 3. tensorflow 4. matplotlib For keras and tensorflow, please refer to this link (https://docs.floydhub.com/guides/environments/ (https://docs.floydhub.com/guides/environments/)) to make sure you install versions that are compatible with each other. I would highly recommend getting tensorflow==1.14.1 and the compatible keras version. The exact python version, as long as it’s python3+, should not impact your ability to use these two packages. Structure of Assigment What’s new compared to PS3: 1. Convolutional Kernels 2. Ensemble training Terminology Please look over the power point under Piazza > Resources > ConvolutionalNetwork.ppt to make sure you understand exactly what I mean when I type the following terms: 1. Applying a kernel/filter 2. Kernel/Filter 3. Max Pool 4. Feature map 5. Convolution Network Architecture title You will be implementing variations of LeNet-5 by hand, with flexible kernel shapes. In addition you will be implementing ensemble learning using LeNet-5. The static architecture can be seen above: 1. 2 convolutional layers, each followed by a max-pooling layer 2. 3 fully conected layers with an output shape of 10 You will notice that each of the hidden convolutional outputs and max-pooling outputs are ??x?? or ?x? in terms of their dimensions. That is intentional as your first task is to figure out exactly what those are. 1. Kernel 1: 4 x 4, padding = 0, stride = 1 2. Kernel 2: 4 x 4, padding = 0, stride = 1 3. MaxPool1: 2 x 2, padding = 0, stride = 2 4. MaxPool2: 2 x 2, padding = 0, stride = 2 These kernel and maxpool sizes are values to start with. After you get the network working with these kernel and maxpool shapes, you will need to adjust it so it can take any valid kernel and any valid max pool shape. Here, we define valid as out_shape is an integer greater than 1, where Static Variables: 1. Number of layers 2. Output shape Flexible variables: 1. Kernel shapes 2. MaxPool filter shapes 3. Number of kernels per conv layer 4. Number of nodes per FC layer Assignment Grading and Procedure Recommendation This assignment overall has ??? points for all the methods you have to implement. Imposed on this total are the following percentages: 1. If you correctly implement all methods, and you can correctly apply one kernel per conv layer, you will earn 80% of the points 2. Correct implementation of multiple kernels applied at each layer will earn you an additional 10% 3. Correct implementation of ensemble learning will earn you an additional 5% 4. Experimentation on parameters will earn you the last 5% of the points. See the bottom of this document for details For instance, if you correctly implement all methods but do not have multiple kernels, nor ensemble training you will recieve (87 x 0.75) out of the possible 87. If you accomplish the situation above but also correctly add drop out and Ensemble training, you will then recieve (87 * 0.95) out of the possible 87. Here is how I recommend going about this assignment: 1. Implement the network with batch training, weight decay, bias terms, and one kernel applied to each conv layer 2. Add multiple kernels per conv layer functionality _ℎ = + 1 _ℎ + 2 ∗ − _ℎ 3. Add different kernel shape functionality 4. Implement ensemble learning 5. Experimentation A note on “different kernel shape”: For a convolutional network, all kernels applied at the same layer will be the same shape, when I say that your network should be able to handle differnt kernel shapes, that means if you change the kernel shape at a given layer, all kernels applied on that layer will adopt the new shape. For instance, Kernel1 begins as a 4×4 kernel. This means if I wanted to apply multiple Kernel1’s to the input layer, then I will apply multiple 4×4 kernels (they are all the same shape). If I change my network such that Kernel1 is now a 6×6 kernel that means ALL applications of kernel1 will now be 6×6. Data Format You will notice that this assignment has very little headers and comments. I am leaving it up to you to decide exactly what info you need to incorporate for each function as a parameter, and the functionality and output of each that function. Feel free to use the previous problem sets as models for how to model your code. I recommend you continue to format your data in terms of N x M 1. N = number of features 2. M = number of data points Loops during multi-kernel convolution In order to not get you guys bogged down on dealing with 3D and 4D matrix multiplication, I will say the following: The application of a single kernel should not impose any loops (straight matrix multiplication). However, when you reach the stage of applying multiple kernels to a single layer, I would recommend you simply loop through all kernel matrices for that layer and apply them one at a time. This means that if your data begins as a NxM (2D matrix), then each kernel application will produce a (N1 x M) 2D matrix, where N1 = the flattened feature map of the kernel application. These (N1 x M) matrices can be kept separately, rather than combining them into one 3D matrix. Data management We will be using four data sets for this problem set. 1. MNIST (the most popular computer vision data set) 2. Dummy data (for testing purposes) Graded Exercise (15 points total, 3 points each) – implement the following functions for data parsing: In [2]: from keras.datasets import cifar10 import tensorflow as tf from tensorflow import keras from matplotlib import pyplot import numpy as np ####BEGIN CODE HERE#### def gen_dummy(): ”’ dummy data is exceptionally useful to test whether or not your netwo rk behaves as expected. For dummy data, you should generate a few (<= 5) input/output pairs that you can use to test your forward and backward propagation algorithms output: dummy_x = a NxM np matrix, both dimensions of your choosing of very simple data dummy_y = a (M, ) np array with the corresponding labels ''' dummy_x = [] dummy_y = [] ####BEGIN CODE HERE#### ####END CODE HERE#### return dummy_x, dummy_y def load_mnist(): ''' look up how to load the mnist data set via keras ''' return 0 def flatten_normalize(): ''' convert the image from a N1xN1xM to NxM format where N1 = square_roo t(N) and normalize ''' return 0 def subset_mnist_training(): ''' Return 100 training samples from each of the 10 classes, 1000 sample s all together ''' return 0 def subset_mnist_testing(): ''' Return 20 training samples from each of the 10 classes, 200 samples all together ''' return 0 ####END CODE HERE Graded exercise (28 points total, 4 points each): Complete the following helper functions You will notice that there are no parameters, that is up to you to determine what each function needs. You will also notice there isn't much explanation as to what each function does. That is because you should determine what each function takes as parameters and what they return. The only thing I ask you not to change is the function name In [4]: ####BEGIN CODE HERE#### def log_cost(): ''' computes the log cost of the current predictions using the labels (s ame as PS3) ''' def softmax(): ''' computes the softmax of the input (same as PS3) ''' return 0 def ReLU(): ''' computes the ReLU of the input (same as PS3) ''' return 0 def ReLU_prime(): '''' computes the ReLU' of the input (same as PS3) ''' return 0 def kernel_to_matrix(): ''' converts a kernel to its matrix form ''' return 0 def max_pool(): ''' applies max-pooling to an input image ''' return 0 def max_pool_backwards(): ''' takes the output of a maxpool, and projects back to the original sha pe. see PPT slides on convolutional backprop if you have no idea what I'm talking about. ''' return 0 ####END CODE HERE#### Graded exercise (6 points total, 2 points per function) - Complete the initialization functions In [5]: ####BEGIN CODE HERE#### def kernel_initialization(): ''' returns a kernel with specified height and width. The values of the kernel should be initialized using the same formul a as the He_initialize_weight() function ''' return 0 def He_initialize_weight(): ''' (same as PS3) returns a weight matrix with the passed in dimensions ''' return 0 def bias_initialization(): ''' (same as PS3) returns a bias matrix of the passed in dimensions ''' return 0 ####END CODE HERE#### Forward and BackProp functions title Graded Exercise (28 points total, 4 points each) - complete the following functions: 1. predict() 2. delta_Last() 3. delta_el() 4. dW() 5. db() 6. weight_update() 7. bias_update() For full credit for predict, you will need to incorporate the bias terms correctly, and for weight_update, you will need to correctly utilize the weight_decay parameter You may find the image above helpful when implementing the backpropagation methods In [6]: ####BEGIN CODE HERE#### def predict(): ''' minimum output: the predictions made by the network You are free to return more things from this function if you see fit hint: you will need to return all intermediate computations, not jus t the output To figure out what you need to return, look at what intermediate res ults you need to compute backpropagation you can return more than one variable with the following syntax: return var1, var2, ..., varN ''' return 0 def delta_Last(): ''' task: computer error term for ONLY output layer ''' return 0 def delta_el(): ''' task: compute error term for any hidden layer ''' ####BEGIN CODE HERE #### return 0 ####END CODE HERE#### def dW(): ''' task: compute gradient for any weight matrix ''' return 0 def db(): ''' task: compute gradient for any bias term ''' return 0 def weight_upate(): ''' task: udpate each of the weight matrices, and return them in a varia ble, name of your choosing ''' return 0 def bias_update (): ''' task : update each of the bias terms, return them in a variable, nam e of your choosing ''' return 0 TRAINING Graded Exercise (10 points): Complete the training function below In [4]: def train(): ''' IN ADDITION: please have the train function output a graph of the co st using matplotlib.pyplot. If everything works correctly the cost should be monotonically decre asing ''' ####BEGIN CODE HERE#### ####END CODE HERE#### return 0 In [7]: ''' In the section below is where you should setup all your weights, bia s, parameters, and input/labels Then pass all the relevant parameters to train and run and debug it Remember: no aspect of the network size should be hard coded except the output layer (which should always have three nodes), they should adjustable via variables like the one s provided. The size of your input layer will be dictated by which data set you are using. Once again, that parameter should be in terms of the variables, and not a hard coded size ''' output_nodes = 10 ####BEGIN CODE HERE#### ####END CODE HERE#### Testing There are two steps to the testing process: first, we need to take the output of our network and make a decision, either class 0, class 1, ..., or class 10, and second we need to measure how well our network performs on the testing set. Remember that our network outputs 10 probability values, all between 0 and 1, and we need turn this vector of three elements into a single output: the predicted class of the data point by the network. Simply pick the index of the largest probability value as the decision. Graded Exercise(5 points total) - implement decision (1 point) and test(4 points) In [6]: ####BEGIN CODE HERE#### def decision(prediction): ''' input: a (10, M) matrix where each column is the softmax output for data point i, 0 <= i <= M output: a (M, ) np array where each element corresponds to the highe st probability class from prediction ''' return 0 def test(): ''' output: the accuracy of your model on the unseen x and y pairs ''' return accuracy ####END CODE HERE#### In [ ]: ''' Here is some space for you to call and test your test/decision functions ''' ####BEGIN CODE HERE#### ####END CODE HERE#### Ensemble learning: This is a fairly common technique to try and promote regularization in a system. Simply put, you train multiple networks (either of the same or different sizes/shapes), on the same data. Then, you pass each of the networks a testing set, and initiate a "vote". The voting procedure is: for each testing point, the class with majority vote wins. If none of your models agree on a class, then you just randomly pick from their decisions. Tie breakers are also determined randomly. To recieve full credit for ensemble learning, simply train a minimum of 5 unique networks, and correctly code a voting process In [ ]: ####BEGIN CODE HERE#### ''' ensemble training and testing space ''' ####END CODE HERE#### Experimentation explanation: What are you experimenting on? Any aspect of your network that is not learned by SGD, and is not the number of layers of the network, is free for experimentation. Learning rate, number of training samples, epochs, nodes per layer etc etc are a few examples. All I ask in this section, is that you try to maximize your performance on the fashion_mnist data set and mnist dataset on at least 60 testing samples, and keep track of what you found below. The write up can be as simple as: "The best I got with mnist was (accuracy) with these parameters: (list of parameters)". But ideally you would talk a bit more about the trends, such as "I noticed that if i kept decreasing my learning rate, the performance on X dataset would improve until a certain point, then would become worse". There are a lot of ways to get full credit on this experimentation portion, as there is no dedicated format I am requesting. There is no benchmark. There is no "you must get 100% accuracy". I simply ask you to see how well your models can do with the restrictions in place. Best of luck, Ryan Experimentation notes: This cell has been left in text format for you to freely edit and keep track of your experimentations. In [ ]: