The Effect of 401k Eligibility on Financial Wealth
In this brief illustration, we use data from the 1991 wave of the Survey of Income and Program Participation (SIPP), originally analyzed by Poterba, Venti & Wise (1995), to study how the 401(k) retirement plan affects individuals’ net financial assets.
Data preparations
The main variables of interest are:
Variable | Description |
---|---|
net_tfa | Net total financial assets (outcome) |
e401 | 401(k) eligibility |
p401 | 401(k) participation |
age | Age of the head of the household |
tw | Total wealth |
inc | Household income |
fsize | Household size |
db | Defined benefit pension status indicator |
marr | Married indicator |
twoearn | Two-earner status indicator |
pira | IRA participation indicator |
hown | House ownership indicator |
We load the data into R/Stata and prepare the data.
R code
To illustrate the implementation of DML in R, we use the ddml
package. Another popular alternative is DoubleML
, see here.
library(haven)
library(ddml)
dat <- read_dta("http://dmlguide.github.io/assets/dta/PVW_data.dta")
set.seed(20241111)
# Define control variables
control_names <- c("age", "tw", "inc", "fsize", "db", "marr",
"twoearn", "pira", "hown")
# Extract variables
y <- dat$net_tfa
D <- dat$e401
X <- as.matrix(dat[, control_names])
Stata code
To illustrate the implementation of DML in Stata, we use the ddml
package. Note that you will also need to install pystacked
, which implements a wide range of machine learners for Stata and uses Python’s sklearn
internally. For pystacked
to run, you will need to set up Stata’s Python integration, see instructions here.
set seed 123
use "https://dmlguide.github.io/assets/dta/PVW_data.dta", clear
global Y net_tfa
global X age tw inc fsize db marr twoearn pira hown
global D p401
global Z e401
Model specification
Before we can run the DML estimation, we need to select a nuisance function estimator. This could be linear regression or almost any supervised machine learner (satisfying a mild congerence rate requirement).1 While the choice of the machine learner is important and can affect the estimation results, we focus here for illustration on random forests with 1000 trees. For the outcome equation, we use a maximum tree depth of 8; for other nuisance functions we opt for a a maximum tree depth of 4.
R code
Learner settings
ddml::mdl_ranger()
is a wrapper ofranger::ranger()
. To learn which tuning parameter you can set, check the help file with?ddml::mdl_ranger
. A number of other wrapper commands are included in theddml
package and you can also write your own wrapper commands, see this guide.
# Define model configuration
learners = list(
list(fun = mdl_ranger,
args = list(num.trees = 1000, # random forest
max.depth = 8)))
learners_DX = list(
list(fun = mdl_ranger,
args = list(num.trees = 1000, # random forest
max.depth = 4)))
Stata code
We will save the hyper-parameters in a global macro which we use below.
global rfoutcome n_estimators(1000) max_depth(8)
global rfother n_estimators(1000) max_depth(4)
The effect of 401k elligibility
We replicate the DML estimate of the average treatment effects of 401k elligibility reported in Table 1, last column. Using the above random forest learners, we now estimate two different target paramters with DML: the partially linear regression coefficient and the average treatment effect.
R code
We use ensemble_type = "average"
to disable stacking.
# DML estimation of the PLR coefficient
plm_fit <- ddml_plm(y, D, X,
learners = learners,
learners_DX = learners_DX,
sample_folds = 10,
ensemble_type = "average")
# DML estimation of the ATE
ate_fit <- ddml_ate(y, D, X,
learners = learners,
learners_DX = learners_DX,
sample_folds = 10,
ensemble_type = "average",
trim = 0.001)
summary(ate_fit)
# create texreg-compatible object from `late_lasso`
library(texreg)
ate_texreg <- createTexreg(coef.names="elligibility",
coef=summary(ate_fit)[1],
se=summary(ate_fit)[2],
pvalues=summary(ate_fit)[4]
)
screenreg(list(plm_fit$ols_fit,ate_texreg),
include.ci=FALSE,
custom.coef.map=list("D_r"="elligibility",
"elligibility"="elligibility")
)
R output
======================================
Model 1 Model 2
--------------------------------------
elligibility 7825.37 *** 6877.41 ***
(824.28) (775.73)
--------------------------------------
R^2 0.01
Adj. R^2 0.01
Num. obs. 9915
======================================
*** p < 0.001; ** p < 0.01; * p < 0.05
Stata code
*** partial regression coefficient
ddml init partial, kfolds(10)
ddml E[Y|X]: pystacked $Y $X , type(reg) methods(rf) cmdopt1($rfoutcome)
ddml E[D|X]: pystacked $Z $X , type(reg) methods(rf) cmdopt1($rfother)
ddml crossfit
eststo m1: ///
ddml estimate, robust
*** ATE coefficient
ddml init partial, kfolds(10)
ddml E[Y|X]: pystacked $Y $X , type(reg) methods(rf) cmdopt1($rfoutcome)
ddml E[D|X]: pystacked $Z $X , type(reg) methods(rf) cmdopt1($rfother)
ddml crossfit
eststo m2: ///
ddml estimate, robust
Stata output
. esttab m1 m2
--------------------------------------------
(1) (2)
net_tfa net_tfa
--------------------------------------------
e401 7139.8*** 7147.0***
(8.04) (8.23)
_cons -58.85 -3.273
(-0.17) (-0.01)
--------------------------------------------
N 9915 9915
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
The effect of 401k participation
The data also includes information on whether individuals signed up to the 401k scheme (p401
). This allows us to estimate two other parameters: the local average treatment effect and the partially linear IV regression coefficient. In both cases, we leverage e401
as an instrument and assume that e401
is exogenous conditional on the set of covariates.
R code
# DML estimation of the PLR coefficient
pliv_fit <- ddml_pliv(y, D, Z, X,
learners = learners,
sample_folds = 10,
ensemble_type = "average"
)
# DML estimation of the ATE
late_fit <- ddml_late(y, D, Z, X,
learners = learners,
sample_folds = 10,
ensemble_type = "average",
trim = 0.001
)
# create screenreg-compatible object from `late_fit`
late_texreg <- createTexreg(coef.names="participation",
coef=summary(late_fit)[1],
se=summary(late_fit)[2],
pvalues=summary(late_fit)[4]
)
screenreg(list(pliv_fit$iv_fit,late_texreg),
include.ci=FALSE,
custom.coef.map=list("D_r"="participation",
"participation"="participation")
)
R output
=========================================
Model 1 Model 2
-----------------------------------------
participation 10968.64 *** 9766.53 ***
(1212.56) (1210.69)
-----------------------------------------
R^2 0.02
Adj. R^2 0.02
Num. obs. 9915
=========================================
*** p < 0.001; ** p < 0.01; * p < 0.05
Stata code
*** partial IV regression coefficient
ddml init iv, kfolds(10)
ddml E[Y|X]: pystacked $Y $X , type(reg) methods(rf) cmdopt1($rfoutcome)
ddml E[D|X]: pystacked $D $X , type(reg) methods(rf) cmdopt1($rfother)
ddml E[Z|X]: pystacked $Z $X , type(reg) methods(rf) cmdopt1($rfother)
ddml crossfit
eststo m3: ///
ddml estimate, robust
*** LATE coefficient
ddml init interactiveiv, kfolds(10)
ddml E[Y|X,Z]: pystacked $Y $X , type(reg) methods(rf) cmdopt1($rfoutcome)
ddml E[D|X,Z]: pystacked $D $X , type(reg) methods(rf) cmdopt1($rfother)
ddml E[Z|X]: pystacked $Z $X , type(reg) methods(rf) cmdopt1($rfother)
ddml crossfit
eststo m4: ///
ddml estimate, robust
Stata output
. esttab m3 m4
--------------------------------------------
(1) (2)
net_tfa net_tfa
--------------------------------------------
p401 10034.0*** 9244.3***
(8.05) (8.09)
_cons -28.05
(-0.08)
--------------------------------------------
N 9915 9915
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
We have demonstrated that we can target a wide range of target parameters using DML estimators. However, we shouldn’t take the results too seriously at this stage as we should validate our results by considering alternative nuisance function estimators, change the number of cross-fitting folds and repeat the cross-fitting procedure. We will do this in the next application, see Understanding Cultural Persistence and Change.
-
See Section 3 in the paper. ↩