Machine Learning and Big Data
Beyond the Lasso
EC340 – Topics in Applied Economics
Dr Pedro CL Souza
University of Warwick
Economics Department
AT 2019/20
Beyond the Lasso 1 / 69
Contents
1. Regularization Methods: `0, `1 and `2
2. Forest and Trees
3. Forecasting Macroeconomic Variables
4. Lasso Theory (optional material)
Beyond the Lasso 2 / 69
Regularization Methods: `0, `1 and `2
• In the previous lecture we defined the Lasso estimator
βˆlasso = arg minβ
[
N∑
i=1
(yi − xiβ)2 + λ
K∑
j=1
|βj |
]and the Adaptive Lasso version
βˆadalasso = arg minβ
[
N∑
t=1
(yi − xiβ)2 + λ
K∑
j=1
ωj |βj |
]where ωj = |βˆlasso |−1.
Beyond the Lasso 3 / 69
Regularization Methods: `0, `1 and `2
• The Lasso and Adaptive Lasso fit into a broader category of penalized
estimation methods
• Indeed, the Akaike and Schwartz criteria can also be seen as penalized
methods
• In the previous lecture, we defined
AIC = 2 ln
(
1
n
n∑
i=1
(yi − xiβ)2
)
+
2(k + 1)
n
BIC = 2 ln
(
1
n
n∑
i=1
(yi − xiβ)2
)
+
(k + 1) ln n
n
where k is the number of included variables in the model. Since, in this
case, no estimated parameter is zero, k = ‖β‖0
So this is also seen as `0 penalization, but as we showed it is
computationally infeasible
Beyond the Lasso 4 / 69
Regularization Methods: `0, `1 and `2
• Ridge regression: `2 penalization
βˆridge = arg minβ
[
N∑
i=1
(yi − xiβ)2 + λ
K∑
j=1
β2j
]The apparent minor difference (`1 penalization becomes `2) has profound
impact on the properties of the estimator
• The Ridge regression has been shown to handle correlated regressors well
very, close to multicollinearity
• Define the objective function
f (β) =
N∑
i=1
(yi − xiβ)2 + λ
K∑
j=1
β2j
with K = 1, the first-order conditions are
∂f (β)
∂β
= −2
N∑
i=1
(yi − xi βˆridge)xi + 2λβˆridge = 0
Beyond the Lasso 5 / 69
Regularization Methods: `0, `1 and `2
• Working the expression for the first-order condition,
∂f (β)
∂β
= −2
N∑
i=1
(yi − xi βˆridge)xi + 2λβˆridge = 0
and therefore
N∑
i=1
yixi − βˆridge
N∑
i=1
x2i − λβˆridge = 0
⇒ βˆridge =
∑N
i=1 yixi∑N
i=1 x
2
i + λ
• Since λ ≥ 0,
βˆridge =
∑N
i=1 yixi∑N
i=1 x
2
i + λ
≤
∑N
i=1 yixi∑N
i=1 x
2
i
= βˆols
and therefore βˆridge is also biased towards zero
• βˆridge is non-zero for every j ⇒ no element of model selection
Beyond the Lasso 6 / 69
Regularization Methods: `0, `1 and `2
• Elastic Net: combining `1 and `2 penalties
βˆenet = arg minβ
[
N∑
i=1
(yi − xiβ)2 + λ1
K∑
j=1
|βj |+ λ2
K∑
j=1
β2j
]and the adaptive version
βˆenet = arg minβ
[
N∑
i=1
(yi − xiβ)2 + λ1
K∑
j=1
ωj |βj |+ λ2
K∑
j=1
β2j
]where ωj =
1
|βˆj,enet |
Beyond the Lasso 7 / 69
Regularization Methods: `0, `1 and `2
• Finally, penalized likelihood methods are very useful and find many
applications, including textual analysis
• Lasso estimator, again
βˆlasso = arg min
β
N∑
i=1
(yi − xiβ)2︸ ︷︷ ︸
sum of squared residuals
+λ‖β‖1
suitable to continuous outcomes yi
• Penalized likelihood
βˆpenlik = arg min
β
lnL(β)︸ ︷︷ ︸
log-likelihood function
+λ‖β‖1
suitable to, for example, binary outcomes yi
Beyond the Lasso 8 / 69
Regularization Methods: `0, `1 and `2
• Why likelihood?
– Example: logistic regression and limited dependent variable models
– Suppose that we were looking for causes that determine employment
(yi = 1) or unemployment (yi = 0)
– Many candidate explanatory variables xi that can include age, gender,
education, parents’ education, …
– Logit model
P(yi = 1|xi ) = G(β0 + xi1β1 + · · ·+ xiKβK ) = G(xiβ)
with
G(z) =
exp(z)
1 + exp(z)
Beyond the Lasso 9 / 69
Regularization Methods: `0, `1 and `2
• Why likelihood?
– The likelihood of the logit model is
L(β) =
N∏
i=1
[G(xiβ)]yi [1− G(xiβ)]1−yi
and the log-likelihood is
lnL(β) =
N∑
i=1
yi ln [G(xiβ)] + (1− yi ) ln[1− G(xiβ)]=
N∑
i=1,yi=1
ln [G(xiβ)] +
N∑
i=1,yi=0
ln[1− G(xiβ)]Beyond the Lasso 10 / 69
Regularization Methods: `0, `1 and `2
• Why likelihood?
– Fitting the standard logit model is feasible if K is small
– If K is large, estimates are very volatile and lead to poor predictive
performance ⇒ Penalized likelihood solves this issue
– Application: analysis of text such as Facebook posts, tweets,
speeches, Central Banks’ reports
Beyond the Lasso 11 / 69
B See beyondLasso.Rmd, sections A-C B
Beyond the Lasso 12 / 69
Contents
1. Regularization Methods: `0, `1 and `2
2. Forest and Trees
3. Forecasting Macroeconomic Variables
4. Lasso Theory (optional material)
Beyond the Lasso 13 / 69
Regression Trees
• Regression trees fall outside the scope of regularized methods, but are very
useful for prediction and classification problems
• Very flexible and inherently incorporates non-linearities, while the previous
methods were linear
• The starting point for the Lasso and varieties was the linear model
yi = xiβ + i
and for the logistic regression, the probability is still a linear function of xi ,
P(yi = 1|xi ) = xiβ
for a K × 1-sized vector β, where K is large
Beyond the Lasso 14 / 69
Regression Trees
• Non-linearities can take many forms:
– Approximately linear
yi = α + xi1β11 + x
2
i1β12 + · · ·+ xiKβK1 + x2iKβK2 + i
– Piecewise linear
yi = f1(xi1) + f2(xi2) + · · ·+ fK (xiK ) + i
– Strongly non-linear
yi = f (xi1, . . . , xiK ) + i
… and the econometrician never knows the true model
Beyond the Lasso 15 / 69
Regression Trees
• Regression Trees is a very flexible, inherently non-linear model, while still
parsimonious
• To simplify, consider that K = 2 and there are two predictors, xi1 and xi2,
which belong to the unit interval [0, 1]xi1
xi2
(0,0) (1,0)
(1,1)(0,1)
Beyond the Lasso 16 / 69
Regression Trees
• Start the process by letting the prediction anywhere in the interval be cˆ
xi1
xi2
(0,0) (1,0)
(1,1)(0,1)
cˆ
• If we are minimizing the mean squared prediction error, the prediction is
cˆ = y¯
Beyond the Lasso 17 / 69
Regression Trees
• The prediction can possibly be made better if we split into two regions R1
and R2, say
R1 = {(xi1, xi2) : xi1 ≤ t1}
R2 = {(xi1, xi2) : xi1 > t1}
xi1
xi2
(0,0) (1,0)
(1,1)(0,1)
Beyond the Lasso 18 / 69
Regression Trees
• The best prediction (in the sense of minimizing the mean squared
prediction error) within R1 is the average of y conditional on values of xi1
and xi2 that satisfy R1
– We define this prediction as cˆ1
– Similarly for R2 with prediction cˆ2
xi1
xi2
(0,0) (1,0)
(1,1)(0,1)
cˆ1 cˆ2
Beyond the Lasso 19 / 69
Regression Trees
• We can keep on partitioning the areas and conditioning the predictions
• For example, with five partitions
xi1
xi2
(0,0) (1,0)
(1,1)(0,1)
cˆ1
cˆ2
cˆ3
cˆ4
cˆ5
Beyond the Lasso 20 / 69
xi1
xi2
(0,0) (1,0)
(1,1)(0,1)
cˆ1
cˆ2
cˆ3
cˆ4
cˆ5
Graph from Chapter 9 of Hastie, Tibshirani and Friedman, “The Elements of Statistical Learning”, Springer 2009
Beyond the Lasso 21 / 69
Regression Trees
• Overall, given a partition {R1, . . . ,RM}, the prediction is
yˆ =
M∑
m=1
cm · I{(xi1, xi2) ∈ Rm}
where I is the indicator functi
on
• More generally, for more than K = 2 predictors,
yˆ =
M∑
m=1
cm · I (xi ∈ Rm)
• How to select the partitions Rm and how many of them there are (M)?
Beyond the Lasso 22 / 69
Regression Trees
• Step-wise algorithm:
1. Start with the full data and a single R1, so yˆ = y¯
2. Find the variable j and the split point s such that the prediction error
is minimized
min
j,s
∑
xi∈R1(j,s)
(yi − cˆ1) +
∑
xi∈R2(j,s)
(yi − cˆ2)
3. Keep splitting until M partitions are obtained
• Algorithm: automated and with objective criteria, but computationally
heavy although not unfeasible
Beyond the Lasso 23 / 69
Regression Trees
• How to select M?
– The prediction error for each partition m is
Qm =
1
Nm
∑
xi∈Rm
(yi − cˆm)2
where Nm is the number of xi that fall within Rm
Nm = ]{xi ∈ Rm}
and cˆm is prediction in partition Rm
cˆm =
1
Nm
∑
xi∈Rm
yi
Beyond the Lasso 24 / 69
Regression Trees
• How to select M?
– The greater M is selected, the larger the tree
– Risk of overfitting the data, with consequences similar to the linear
regression we saw in lecture 1
– But a very small tree might not capture important feature of the data
– Define the criteria that balances the two objectives
Cα =
M∑
m=1
αmQm + α|M|
where α governs the trade-off between a large and small trees
– There are automatic methods that pick α to maximize the
performance of the algorithm
Beyond the Lasso 25 / 69
Random Trees
• Random Trees: small modification of the regression trees
• Step-wise algorithm, random-tree version:
1. Start with the full data and a single R1, so yˆ = y¯
2a. Randomly a subset p ≤ K of the covariates as candidates for splitting
(usual value is p =
√
K)
2b. Find the variable j within those selected in step 2a and the split point
s such that the prediction error is minimized
min
j,s
∑
xi∈R1(j,s)
(yi − cˆ1) +
∑
xi∈R2(j,s)
(yi − cˆ2)
3. Keep splitting until M partitions are obtained
Beyond the Lasso 26 / 69
Random Trees
• Random Tree may seem counter-intuitive. Why add the random
pre-selection of the covariates in step 2a?
– On itself, just the random step 2a might deteriorate the performance
because the search for variables (and split points) in step 2b is
conducted on a smaller subset of the covariates
– The idea is to generate many random trees and combine their
predictions
– Many random trees = random forest
Beyond the Lasso 27 / 69
Random Forest
• Random Forest algorithm, version 1:
1. Follow steps 1-3 of the random tree algorithm B times. For each, let
Tb(x) be the prediction of the b-th tree
{T1(xi ), . . . ,TB(xi )}
2. Average the predictions from each individual B random trees for the
final output
yˆi =
1
B
B∑
b=1
Tb(xi )
2 (alt). If yi is a categorical outcome, choose the most frequent output among
{T1(xi ), . . . ,TB(xi )}
i.e., the majority output
Beyond the Lasso 28 / 69
Random Forest
• The Random Forest actually goes a step further
• Bootstrap: procedure whereby a random subsample is drawn from the
original data
– In the cross-sectional case, for a dataset of size N, the bootstrapped
sample is constructed by sampling N observations from the original
data with replacement and probability 1
N
Original Data
y1 x11 x12
y2 x21 x22
y3 x31 x32
…
…
…
yN xN1 xN2
Bootstrap−→
Bootstrap Data
y7 x71 x72
y9 x91 x92
y1 x11 x12
…
…
…
y9 x91 x92
Note that, given the sampling is done with replacement, the same
observation may occur more than once in the bootstrapped sample
Beyond the Lasso 29 / 69
Random Forest
• Bootstrap: (cont’d)
– In the time-series case, sampling the observations in this way would
destroy the dependence over time
– Instead, observations are sampled in blocks
y1 y2 y3 y4 y5 y6 y7 y8 y9
Block 1Block 2
Block 3
Block 4
so the bootstrapped data is
y5 y6 y7 y2 y3 y7 y8 y9 y9
Block 1 Block 2 Block 3 Block 4
Beyond the Lasso 30 / 69
Random Forest
• Random Forest algorithm, version 2:
1. Draw a bootstrap sample from the original data, and follow steps 1-3
of the random tree algorithm. Repeat the process B times. For each,
let Tb(x) be the prediction of the b-th tree
{T1(xi ), . . . ,TB(xi )}
2. Average the predictions from each individual B random trees for the
final output
yˆi =
1
B
B∑
b=1
Tb(xi )
2 (alt). If yi is a categorical outcome, choose the most frequent output among
{T1(xi ), . . . ,TB(xi )}
i.e., the majority output
Beyond the Lasso 31 / 69
Random Forest
• Why add the bootstrap and random tree steps?
– The reason is that, perhaps counter-intuitively, it makes the averaging
step more efficient
– Suppose that we had B independent and identically distributed
predictions, each with mean zero and variance σ2
– The variance of the average prediction will be
Var
[
1
B
B∑
b=1
yˆi
]=
1
B2
B∑
b=1
Var[yˆi ] =
1
B
σ2
so the more predictions B, the lower the variance of the average
prediction (which decreases the mean squared error)
Beyond the Lasso 32 / 69
Random Forest
• Why add the bootstrap and random tree steps?
– The various predictions from the random tree are likely to be very
correlated among them, since they were generated from the same
original set of data – so they are hardly independent
– If there is a positive pairwise correlation ρ between the predictions yˆi ,
then the variance of the averaged prediction can be worked out to be
Var
[
1
B
B∑
b=1
yˆi
]= ρσ2 +
1− ρ
B
σ2
Now, the second term disappears with B →∞, but the first will not
go away
– Bootstrapping is also a way to reduce the dependence among the
different samples, and therefore reduce the dependence in the
predictions ⇒ reduction in the variance
– Very successful method, as we will see in the following application
Beyond the Lasso 33 / 69
Contents
1. Regularization Methods: `0, `1 and `2
2. Forest and Trees
3. Forecasting Macroeconomic Variables
4. Lasso Theory (optional material)
Beyond the Lasso 34 / 69
Forecasting Macroeconomic Variables
• Paper by Medeiros et al, “Forecasting Inflation in a Data-Rich
Environment”, Working Paper, 2018
It is difficult to overemphasize the importance of forecasting inflation
in rational economic decision-making. Many contracts concerning em-
ployment, sales, tenancy, and debt are set in nominal terms. Therefore,
inflation forecasting is of great value to households, businesses and pol-
icymakers. In addition, central banks rely on inflation forecasts not only
to inform monetary policy but also to anchor inflation expectations and
thus enhance policy efficacy. (p. 1)
Beyond the Lasso 35 / 69
Forecasting Macroeconomic Variables
• In the literature, improving forecast accurately beyond simple benchmark
models such as AR has proven to be a challenge
• Paper shows that machine learning and big data does just that – dramatic
improvement in the capacity to forecast future inflation
• Data: FRED-MD database
– Full sample: January 1960 to December 2015, 672 observations
– Out-of-sample window: January 1990 to December 2015
– 122 variables with observations in all sample periods
– Consider four lags, as well as four autoregressive terms: 508 potential
predictors
Beyond the Lasso 36 / 69
Forecasting Macroeconomic Variables
• Benchmark measure: change in inflation pit = log(Pt)− log(Pt−1) for a
price level Pt
Inflation
Beyond the Lasso 37 / 69
Forecasting Macroeconomic Variables
• Consider the following (general) model
pit+h
= Gh(xt) + ut+h
where h is the forecast horizon, xt is a vector of covariates (possibly with
lags of pit , and Gh(·) is the mapping between the covariates and the
inflation
• Models: vast array
1. Random walk: pˆit+h = pit
2. Autoregressive of p-th order:
pˆit+h = φˆ0 + φˆ1pit + · · ·+ φˆhpit−p+1
3. Shrinkage models: Lasso, Adaptive Lasso, Ridge, Elastic Net
4. Random Forest (RF)
(among many others)
Beyond the Lasso 38 / 69
Forecasting Macroeconomic Variables
• Evaluate the predictive power by the out-of-the-sample criteria
MSEm,h =
1
T − T0 + 1
T∑
t=T0
eˆ2t,m,h
MAEm,h =
1
T − T0 + 1
T∑
t=T0
|eˆt,m,h|
of model m for forecast horizon h and RMSE =
√
MSE
• Next slide: Random Forest easily outperforms all other methods, along all
forecast horizons by 20%-25%
• Many tests, subsamples, etc… in the paper
Beyond the Lasso 39 / 69
Beyond the Lasso 40 / 69
Contents
1. Regularization Methods: `0, `1 and `2
2. Forest and Trees
3. Forecasting Macroeconomic Variables
4. Lasso Theory (optional material)
Beyond the Lasso 41 / 69
Lasso Theory
The model selection problem
• We framed our machine learning problem as one of model selection
• Our problem is to select among K variables
yi = α + xi1β1 + · · ·+ xiKβ1 + i
= α + xiβ + i
where xi is K × 1 where K is very large, and possibly K > N
• In stacked notation,
y = Xβ +
where y is N × 1, X is N × K and is N × 1
Beyond the Lasso 42 / 69
Lasso Theory
• For example,
y = Xβ +
where K = 75 and N = 100, β1 = · · · = β5 = 1, and β6 = · · · = β75 = 0
The econometrician has K = 75 covariates, but only 5 are relevant
(βi 6= 0) and explain the outcomes
Unfortunately ex-ante the econometrician doesn’t know which are the
relevant (βi 6= 0) and irrelevant covariates (βi = 0)
As we saw in the last lecture, including all K = 75 will lead to poor
predictive performance, so this is not really an option
I.e., this is a problem of model selection
Beyond the Lasso 43 / 69
Lasso Theory
The support
• Define the support S of the model as the set of relevant covariates or the
set of i for which the true βi is not zero
• In the example above,
S = {1, 2, 3, 4, 5}
because the true β1, β2, β3, β4 and β5 are non-zero, and all other true β’s
are zero
• If S was known, the econometrician would have regressed x1, x2, x3, x4
and x5 on y . Denote that regression as
y = XSβ +
where XS selects the columns of X in the set S. This is called the Oracle
regression
• The OLS estimator for the Oracle regression is then
βˆoracle = (X
′
SXS)
−1X ′Sy
Beyond the Lasso 44 / 69
Lasso Theory
Norms
• Define the `p norm of the v × 1 vector ν as
‖ν‖p = (|ν1|p + · · ·+ |νv |p)
1
p
• In particular, we care about the `1 and `2 norms
‖ν‖1 = (|ν1|+ · · ·+ |νv |) =
v∑
i=1
|νi |
‖ν‖2 =
(
|ν1|2 + · · ·+ |νv |2
) 1
2
=
√√√√ n∑
i=1
ν2i =
√
ν′ν
… and the `0 norm which is defined as the sum of non-zero elements in ν
‖ν‖0 = number of non-zero elements in ν
Beyond the Lasso 45 / 69
Lasso Theory
• The notation above is useful because it allows us to write the Lasso
problem in a more concise way
βˆlasso = arg minβ
[
N∑
i=1
(yi − xiβ)2 + λ
K∑
i=1
|βi |
]where
N∑
i=1
(yi − xiβ)2 = ‖y − Xβ‖22
and
K∑
i=1
|βi | = ‖β‖1
so the Lasso problem is may also be written in a concise notation as
βˆlasso = arg minβ ‖y − Xβ‖22 + λ‖β‖1
Beyond the Lasso 46 / 69
Lasso Theory
• There is a solid and deep theory behind the Lasso and Adaptive Lasso
• Derivations make use of high-level mathematics and are nothing similar to
standard asymptotic theory
… and full derivations are beyond the scope of this module
• Yet, we will derive the properties of the Lasso under the following special
conditions
a. Average of yi is zero:
1
N
∑N
i=1 yi = 0
b. Average of xi is zero:
1
N
∑N
i=1 xi = 0
c. Variance of xi is 1 and xi ’s are uncorrelated, so X
′X = IK where IK is
the identity matrix
X ′X = IK ⇒ (X ′X )−1 = IK
Clearly, our interest is in deriving properties of the Lasso estimator
under more general assumptions, but working out this “simple” case
clarifies properties that hold more generally
Beyond the Lasso 47 / 69
Lasso Theory
• Working out the theory allows us to derive and clarify the properties of the
estimators, what it achieves and what it does not
Three types of objectives
1. Prediction loss: the true model is
y = Xβ +
since is random noise and not able to be predicted, we focus on Xβ
If βˆ is the Lasso estimator, the Lasso prediction is X βˆ
So the prediction error is the vector Xβ − X βˆ = X (β − βˆ)
Take the `2 norm ‖X (β − βˆ)‖22 as the metric of prediction loss
Objective is to show that prediction loss is small
Beyond the Lasso 48 / 69
Lasso Theory
Three types of objectives
2. Parameter estimation: estimated βˆ are close to the true parameter β
So the objective is to show that
‖β − βˆ‖22 =
K∑
i=1
(βi − βˆi )2
is small
Beyond the Lasso 49 / 69
Lasso Theory
Three types of objectives
3. Model selection: true model is correctly selected
Also referred to as the “support” of β
The support maps positive entries into 1, negative entries into -1 and 0
into 0. For example,
supp
3.80
−1.1
=
10
−1
The objective is to show that
P
{
supp(β) = supp(βˆ)
}
is as close as possible to 1
Beyond the Lasso 50 / 69
Lasso Theory
• We will use some additional tools to derive the properties of the Lasso
estimator: some inequalities, and the subgradient
Inequalities
• Triangle inequality: if v and u are vectors,
‖u + v‖ ≤ ‖u‖+ ‖v‖
which also applies to the `2 norm
‖u + v‖2 ≤ ‖u‖2 + ‖v‖2
• Reverse triangle inequality:
‖u − v‖ ≥ ‖u‖ − ‖v‖
with the variation that,
‖u + v‖ = ‖u − (−v)‖ ≥ ‖u‖ − ‖ − v‖ = ‖u‖ − ‖v‖
Beyond the Lasso 51 / 69
Lasso Theory
Subgradient
• The Lasso objective function is
βˆ = arg minβ
[
N∑
i=1
(yi − xiβ)2 + λ
K∑
i=1
|βi |
]= arg minβ ‖y − Xβ‖22 + λ‖β‖1
which is non-differentiable at 0 because of the |βi | component
The function f (x) = |x |
Beyond the Lasso 52 / 69
Lasso Theory
• The lack of differentiability prevents us from being able to find analytical
solutions of the Lasso estimator
• Subgradient: the subgradient of a convex function f at a point x0 is a
scalar g such that
f (x)− f (x0) ≥ g(x − x0)
for all x
• We define the set of subderivatives as the interval [a, b] where a and b are
the one-sided limits
a = lim
x→x−0
f (x)− f (x0)
x − x0
b = lim
x→x+0
f (x)− f (x0)
x − x0
Beyond the Lasso 53 / 69
Lasso Theory
• For the absolute function f (x) = |x |, the subderivative is given as
∂f (x)
∂x
=
−1 if x < 0
[−1, 1] if x = 0
1 if x > 0
The function f (x) = |x |
Beyond the Lasso 54 / 69
Lasso Theory
• For simplicity, let K = 1
• Working out the first order conditions for the Lasso objective function
f (β) =
N∑
i=1
(yi − xiβ)2 + λ|β1|
• If βˆlasso > 0,
∂f (β)
∂β
= −2
N∑
i=1
(yi − xi βˆlasso)xi + λ = 0
=
N∑
i=1
yixi − βˆlasso
N∑
i=1
x2i − λ2 = 0
Beyond the Lasso 55 / 69
Lasso Theory
• The expression above implies that
βˆlasso =
∑n
i=
1 yixi∑n
i=1 x
2
i︸ ︷︷ ︸
OLS estimator
− λ
2
1∑n
i=1 x
2
i︸ ︷︷ ︸
Downward bias
= βˆols − λ
2
which only applies if βˆlasso > 0
• If βˆlasso < 0, then
βˆlasso =
∑n
i=1 yixi∑n
i=1 x
2
i︸ ︷︷ ︸
OLS estimator
+
λ
2
1∑n
i=1 x
2
i︸ ︷︷ ︸
Upward bias
= βˆols +
λ
2
which only applies if βˆlasso < 0
• Shrinkage estimator: in either case, the non-zero estimates are biased
towards zero
Beyond the Lasso 56 / 69
Lasso Theory
• What happens if βˆols is small?
• Example: say, for instance, that βˆols = 0.1 and λ = 1
– If βˆlasso > 0, then βˆlasso = βˆols − λ2 = 0.1− 12 = −0.4 (contradiction)
– If βˆlasso < 0, then βˆlasso = βˆols +
λ
2
= 0.1 + 1
2
= 0.6 (contradiction)
– Then βˆlasso must be exactly equal to zero
• Overall,
βˆlasso =
βˆols − λ2 if βˆols > λ2
0 if − λ
2
≤ βˆols ≤ λ2
βˆols +
λ
2
if βˆols < −λ2
which also means that the Lasso estimator selects the coefficient exactly
as zero, depending on the penalization parameter
Beyond the Lasso 57 / 69
Lasso Theory
βˆlasso =
βˆols − λ2 if βˆols > λ2
0 if − λ
2
≤ βˆols ≤ λ2
βˆols +
λ
2
if βˆols < −λ2
• Also means that:
– If λ→∞, βˆlasso = 0
– If λ = 0, βˆlasso = βˆols
– For λ > 0, estimate of β1 might be zero or not
Beyond the Lasso 58 / 69
Lasso Theory
Beyond the Lasso 59 / 69
Lasso Theory
• In the K = 1 case, we had that
βˆlasso =
βˆols − λ2 if βˆols > λ2
0 if − λ
2
≤ βˆols ≤ λ2
βˆols +
λ
2
if βˆols < −λ2
where βˆols = (X
′X )−1X ′y = X ′y
• For any value of K , it can be shown that
βˆj,lasso =
βˆj,ols − λ2 if βˆj,ols > λ2
0 if − λ
2
≤ βˆj,ols ≤ λ2
βˆj,ols +
λ
2
if βˆj,ols < −λ2
where βˆols = (X
′X )−1X ′y = X ′y and βˆj,ols = x ′j y
Beyond the Lasso 60 / 69
Lasso Theory
• The Lasso solution for any K
βˆj,lasso =
βˆj,ols − λ2 if βˆj,ols > λ2
0 if − λ
2
≤ βˆj,ols ≤ λ2
βˆj,ols +
λ
2
if βˆj,ols < −λ2
where βˆj,ols = x
′
j y can be summarized as
βˆj,lasso =
x ′j y − λ2 if x ′j y > λ2
0 if − λ
2
≤ x ′j y ≤ λ2
x ′j y +
λ
2
if x ′j y < −λ2
=
{
x ′j y − sign(x ′j y) · λ2 if |x ′j y | > λ2
0 if |x ′j y | ≤ λ2
Beyond the Lasso 61 / 69
Lasso Theory
• The Lasso solution
βˆj,lasso =
{
x ′j y − sign(x ′j y) · λ2 if |x ′j y | > λ2
0 if |x ′j y | ≤ λ2
clarifies that βˆj,lasso will be zero if |x ′j y | ≤ λ2 and non-zero otherwise
• Model selection refers to the capacity to correctly estimate the support
That is,
– If the true parameter is zero (βj = 0), a zero is estimated
(βˆj,lasso = 0), which will happen if |x ′j y | ≤ λ2
– If the true parameter is non-zero (βj 6= 0), a non-zero is estimated
(βˆj,lasso 6= 0), which will happen if |x ′j y | > λ2
Beyond the Lasso 62 / 69
Lasso Theory
• If βj = 0, when will |x ′j y | ≤ λ2 ?
– Substituting y = Xβ + ,
|x ′j y | = |x ′j (Xβ + )| = |βj + x ′j |
the latter equality is due to the fact that the covariance between
columns j1 and j2 of X are zero if j1 6= j2 and 1 if j1 = j2
– Define the set C that limits the correlation of X with the disturbance
vector . Within the set C,
max(′X ) ≤ λ
2
To clarify, max(′X ) is the maximum covariance between and every
column of X
max { cov(, x1) , . . . , cov(, xK ) }
and is less likely to hold if λ is small
Beyond the Lasso 63 / 69
Lasso Theory
• If βj = 0, when will |x ′j y | ≤ λ2 ?
– If βj = 0,
|x ′j y | = |x ′j (Xβ + )| = |βj + x ′j | = |x ′j | ≤ λ2
on the set C
– The last equation means that: on the set C, if βj = 0, then |x ′j y | ≤ λ2
which implies that βˆj,lasso = 0
In other words, true zeros are always selected as zeros! (on the set C)
Beyond the Lasso 64 / 69
Lasso Theory
• If βj 6= 0, when will |x ′j y | > λ2 ?
– If βj 6= 0
|x ′j y | = |x ′j (Xβ + )| = |βj + x ′j | ≥ |βj | − |x ′j |
using the reverse triangle inequality. On the set C,
|x ′j y | ≥ |βj | − |x ′j | ≥ |βj | − λ2
– On the other hand, βˆj,lasso is non-zero if |x ′j y | > λ2
– Given that |x ′j y | ≥ |βj | − λ2 , surely |x ′j y | > λ2 will be satisfied if
|βj | − λ
2
>
λ
2
because, in this case,
|x ′j y | ≥ |βj | − λ2 >
λ
2
Beyond the Lasso 65 / 69
Lasso Theory
• If βj 6= 0, when will |x ′j y | > λ2 ?
– The condition
|βj | − λ
2
>
λ
2
is equivalent to
|βj | > λ
– So βˆj,lasso will be non-zero as long as βj > λ
– Also means that all non-zero βj with be estimated as non-zero as long
as the smallest βj is greater than the penalization parameter λ
min
j,βj 6=0
|βj | > λ
(this is also known as the beta-min condition)
– Small but non-zero βj they will be shrunk to zero
Beyond the Lasso 66 / 69
Lasso Theory: takeaway from model selection
λ intuition theory ⊗
too low many βˆj,lasso = 0 are not
estimated as zeros be-
cause the penalization is
too small, and the Lasso
is not able to achieve
model selection
set C, defined by
max(′X ) ≤ λ
2
is un-
likely to be satisfied,
and therefore the model
selection results do not
apply
too high βˆj,lasso 6= 0 are shrunk
to zero because the pe-
nalization is too high,
and Lasso is not able to
achieve model selection
set C is likely to be
satisfied, but the
beta-min condition
minj,βj 6=0 |βj | > λ is not
likely to be, so model
selection is not achieved
somewhere in
the middle
some βˆj,lasso are zero,
and some non-zero
set C is likely to be
satisfied and also the
beta-min condition
minj,βj 6=0 |βj | > λ so
model selection is
achieved
Beyond the Lasso 67 / 69
Lasso Theory
• We have discussed model selection, which is not equivalent to the distance
between β and βˆlasso
• On set C, the Lasso estimator will be zero on SC . Therefore
‖β − βˆlasso‖2 = ‖βS − βˆS,lasso‖2 = ‖βS − βˆoracle + βˆoracle − βˆS,lasso‖2
• Using the triangle inequality,
‖β − βˆlasso‖2 ≤ ‖βS − βˆoracle‖2 + ‖βˆoracle − βˆS,lasso‖2
= ‖βS − βˆoracle‖2 +
√∑
j∈S
(x ′j y − sign(x ′j y)λ− x ′j y)2
= ‖βS − βˆoracle‖2 +
√∑
j∈S
(sign(x ′j y)λ)2︸ ︷︷ ︸
λ
√
s
were s is the number of elements in S
• So the cost of not knowing S is λ√s
Beyond the Lasso 68 / 69
Lasso Theory
‖β − βˆlasso‖2 ≤ ‖βS − βˆoracle‖2 + λ
√
s
• The result above is also called an Oracle Inequality because it relates the
error in the parameters as the error that would have been obtained if the
true support S was known plus an additional term
– It also highlights that, the smaller the support set s, the smaller the
error is
– Concept of sparsity: most βj are equal to zero, so s is small
• Prediction loss
‖X (β − βˆlasso)‖2 =
√
(β − βˆlasso)′X ′X (β − βˆlasso)
=
√
(β − βˆlasso)′(β − βˆlasso)
= ‖β − βˆlasso‖2
Beyond the Lasso 69 / 69