辅导案例-COMP6245

School of Electronics and Computer Science University of Southampton COMP6245 (2019/20): Foundations of Machine Learning Lab 4 (of 6) Issue 30 October 2019 Deadline 10 November 2019 (10:00) Feedback 16 November 2019 In this Lab, you are expected to work independently; i.e. you may only discuss with or ask questions from a demonstrator or the lecturer. You are, of course, free to refer to the cited texts or access information from Web-based resources (indeed, this is recommended). Objective • Implementing linear regression • Regularization using quadratic and sparsity-inducing penalties • Implementing sparse regression on a realistic problem in chemoinformatics 1 Linear Least Squares Regression: We will work with the Diabetes dataset from the UCI Machine Learning repository [1] taken from the package sklearn. Load the data and inspect the features and targets. It is usually a good idea to plot a few histograms of the targets and pair-wise scatters of the features in any new problem you are tasked to solve. • Implement a linear predictor that is solved by the pseudo-inverse method: a = ( Y t Y )−1 Y t f , where Y is the N × p input matrix and f is the N × 1 vector of responses. • Solve the same problem using the linear model from sklearn and compare the results. 2 Regularization Tikhonov regularization minimizes the mean squared error with a quadratic penalty on the weights [2] (available online in https://web.stanford.edu/~hastie/Papers/ESLII. pdf): min a ||f − Y a||22 + γ ||a||22 Derive and implement a regularized regression. Show, using two bar graphs of the weights side by side to the same scale, how the two solutions differ. 1 Figure 1: Solutions of Linear and Regularized Regressions 3 Sparse Regression L1 regularization is a method for achieving sparse solutions [3]. It minimizes: min a ||f − Y a||22 + γ ||a||1 We will use the sklearn package for implementing its solution. For the Diabetes problem considered above, solve the lasso problem and plot the resulting weights as a bar graph. Observe how the number of non-zero weights change with the regularization parameter γ. Your comparisons should look similar to Fig. 1. In each of these cases, compare the prediction errors. In the case of the sparse regression, would you say the features with nonzero weights are more meaningful (to answer, you have to find the source of the data and look at the variables)? Regularization Path In implementing the lasso it is convenient to study the regularization path (Fig. 2, Image taken from https://scikit-learn.org/stable/auto_examples/linear_model/plot_ lasso_lars.html) Implement and study the regularization path for the six-variable illus- Figure 2: Regularization Path: How regression coefficients change with hyperparameter. trative example considered in [3]. 4 Solubility Prediction: We will now look at a large problem of predicting solubility of chemical compounds from features derived from their molecular structure. Predicting function from structural vari- 2 ables is an important problem because it is easy to define and synthesize small chemical compounds, but very expensive to test them experimentally. Hence the step known as in silico screening is increasingly popular. The dataset we will use is from Huuskonen et al. [4] and the problem has also been considered recently in Pirashvili et al. [5] using more sophisticated machinery. Have a skim-read through the introductory and results sections of these papers. Data used in [4] with several additional features and more compounds is available in the excel spread sheet Husskonen Solubility Features.xlsx. • Load the data, split into training and test sets, implement a linear regression and plot the predicted solubilities against the true solubilities on the training and test sets. To facilitate comparison, draw the two scatter plots side by side to the same scale on both axes. • Implement a lasso regularized solution and plot graphs of how the prediction error (on the test data) and the corresponding number of non-zero coefficients change with increasing regularization. • Are you able to make any comment comparing your results to those claimed in [4] or [5]? Report Write a short report of no more than four pages, summarising your work. Appendix: Snippets of Code 1. Linear regression on diabetes dataset from sklearn import datasets from sklearn.linear_model import LinearRegression # Load data, inspect and do exploratory plots # diabetes = datasets.load_diabetes() Y = diabetes.data f = diabetes.target NumData, NumFeatures = Y.shape print(NumData, NumFeatures) print(f.shape) fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 4)) ax[0].hist(f, bins=40) ax[0].set_title(“Distribution of Target”, fontsize=14) ax[1].scatter(Y[:,6], Y[:,7], c=’m’, s=3) ax[1].set_title(“Scatter of Two Inputs”, fontsize=14) 2. Comparing pseudo-inverse solution to sklearn output 3 # Linear regression using sklearn # lin = LinearRegression() lin.fit(Y, f) fh1 = lin.predict(Y) # Pseudo-incerse solution to linear regression # a = np.linalg.inv(Y.T @ Y) @ Y.T @ f fh2 = Y @ a # Plot predictions to check if they look the same! # fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,5)) ax[0].scatter(f, fh1, c=’c’, s=3) ax[0].grid(True) ax[0].set_title(“Sklearn”, fontsize=14) ax[1].scatter(f, fh2, c=’m’, s=3) ax[1].grid(True) ax[1].set_title(“Pseudoinverse”, fontsize=14) 3. Tikhanov Tegularizer gamma = 0.5 aR = np.linalg.inv(Y.T @ Y + gamma*np.identity(NumFeatures)) @ Y.T @ f fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8,4)) ax[0].bar(np.arange(len(a)), a) ax[0].set_title(’Pseudo-inverse solution’, fontsize=14) ax[0].grid(True) ax[0].set_ylim(np.min(a), np.max(a)) ax[1].bar(np.arange(len(aR)), aR) ax[1].set_title(’Regularized solution’, fontsize=14) ax[1].grid(True) ax[1].set_ylim(np.min(a), np.max(a)) 4. Sparsity inducing (lasso) regularizer from sklearn.linear_model import Lasso ll = Lasso(alpha=0.2) ll.fit(Y, f) yh_lasso = ll.predict(Y) fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,4)) ax[0].bar(np.arange(len(a)), a) ax[0].set_title(’Pseudo-inverse solution’, fontsize=14) ax[0].grid(True) ax[0].set_ylim(np.min(a), np.max(a)) ax[1].bar(np.arange(len(ll.coef_)), ll.coef_) ax[1].set_title(’Lasso solution’, fontsize=14) ax[1].grid(True) ax[1].set_ylim(np.min(a), np.max(a)) 5. Lasso Regularization path on a synthetic example: 4 %matplotlib inline import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import lasso_path from sklearn import datasets # Synthetic data: # Problem taken from Hastie, et al., Statistical Learning with Sparsity # Z1, Z2 ~ N(0,1) # Y = 3*Z1 -1.5*Z2 + 10*N(0,1) Noisy response # Noisy inputs (the six are in two groups of three each) # Xj= Z1 + 0.2*N(0,1) for j = 1,2,3, and # Xj= Z2 + 0.2*N(0,1) for j = 4,5,6. N = 100 y = np.empty(0) X = np.empty([0,6]) for i in range(N): Z1= np.random.randn() Z2= np.random.randn() y = np.append(y, 3*Z1 – 1.5*Z2 + 2*np.random.randn()) Xarr = np.array([Z1,Z1,Z1,Z2,Z2,Z2])+ np.random.randn(6)/5 X = np.vstack ((X, Xarr.tolist())) # Compute regressions with Lasso and return paths # alphas_lasso, coefs_lasso, _ = lasso_path(X, y, fit_intercept=False) # Plot each coefficient # fig, ax = plt.subplots(figsize = (8,4)) for i in range(6): ax.plot(alphas_lasso, coefs_lasso[i,:]) ax.grid(True) ax.set_xlabel(“Regularization”) ax.set_ylabel(“Regression Coefficients”) 6. Predicting Solubility of Chemical Compounds %matplotlib inline import matplotlib.pyplot as plt import numpy as np import pandas as pd sol = pd.read_excel(“Husskonen_Solubility_Features.xlsx”, verbose=False) print(sol.shape) colnames = sol.columns print(colnames) f = sol[“LogS.M.”].values fig, ax = plt.subplots(figsize=(4,4)) ax.hist(f, bins=40, facecolor=’m’) ax.set_title(“Histogram of Log Solubility”, fontsize=14) ax.grid(True) 5 Y = sol[colnames[5:len(colnames)]] N, p = Y.shape print(Y.shape) print(f.shape) # Split data into training and test sets # from sklearn.model_selection import train_test_split Y_train, Y_test, f_train, f_test = train_test_split(Y, f, test_size=0.3) # Regularized regression # gamma =
2.3 a = np.linalg.inv(Y_train.T @ Y_train + gamma*np.identity(p)) @ Y_train.T @ f_train fh_train = Y_train @ a.values fh_test = Y_test @ a.values # Plot training and test predictions # fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,4)) ax[0].scatter(f_train, fh_train, c=’m’, s=3) ax[0].grid(True) ax[0].set_title(“Training Data”, fontsize=14) ax[1].scatter(f_test, fh_test, c=’m’, s=3) ax[1].grid(True) ax[1].set_title(“Test Data”, fontsize=14) # Over to you for implementing Lasso # References [1] K. Bache and M. Lichman, “UCI machine learning repository.” http://archive.ics. uci.edu/ml, 2013. [2] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer, 2008. [3] T. Hastie, R. Tibshirani, and M. Wainwright, Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC, 2015. [4] J. Huuskonen, M. Salo, and J. Taskinen, “Aqueous solubility prediction of drugs based on molecular topology and neural network modeling,” Journal of Chemical Information and Computer Sciences, vol. 38, no. 3, pp. 450–456, 1998. [5] M. Pirashvili, L. Steinberg, B. G. F., M. Niranjan, J. G. Frey, and J. Brodzki, “Im- proved understanding of aqueous solubility modeling through topological data analy- sis,” Journal of Cheminformatics, vol. 10, no. 1, p. 54, 2018. Mahesan Niranjan November 2014 6

辅导案例-COMP6245

Related

Previous Post辅导案例-CSCU9A3

Next Post辅导案例-CASA0007

Author admin