The Effect of 401k Eligibility on Financial Wealth

In this brief illustration, we use data from the 1991 wave of the Survey of Income and Program Participation (SIPP), originally analyzed by Poterba, Venti & Wise (1995), to study how the 401(k) retirement plan affects individuals’ net financial assets.

Data preparations

The main variables of interest are:

Variable	Description
`net_tfa`	Net total financial assets (outcome)
`e401`	401(k) eligibility
`p401`	401(k) participation
`age`	Age of the head of the household
`tw`	Total wealth
`inc`	Household income
`fsize`	Household size
`db`	Defined benefit pension status indicator
`marr`	Married indicator
`twoearn`	Two-earner status indicator
`pira`	IRA participation indicator
`hown`	House ownership indicator

We load the data into R/Stata and prepare the data.

R code

To illustrate the implementation of DML in R, we use the ddml package. Another popular alternative is DoubleML, see here.

library(haven)
library(ddml)
dat <- read_dta("http://dmlguide.github.io/assets/dta/PVW_data.dta")

set.seed(20241111)

# Define control variables
control_names <- c("age", "tw", "inc", "fsize", "db", "marr", 
                  "twoearn", "pira", "hown")

# Extract variables
y <- dat$net_tfa
D <- dat$e401
X <- as.matrix(dat[, control_names])

Stata code

To illustrate the implementation of DML in Stata, we use the ddml package. Note that you will also need to install pystacked, which implements a wide range of machine learners for Stata and uses Python’s sklearn internally. For pystacked to run, you will need to set up Stata’s Python integration, see instructions here.

set seed 123

use "https://dmlguide.github.io/assets/dta/PVW_data.dta", clear 

global Y net_tfa
global X age tw inc fsize db marr twoearn pira hown
global D p401
global Z e401

Model specification

Before we can run the DML estimation, we need to select a nuisance function estimator. This could be linear regression or almost any supervised machine learner (satisfying a mild congerence rate requirement).¹ While the choice of the machine learner is important and can affect the estimation results, we focus here for illustration on random forests with 1000 trees. For the outcome equation, we use a maximum tree depth of 8; for other nuisance functions we opt for a a maximum tree depth of 4.

R code

Learner settings

ddml::mdl_ranger() is a wrapper of ranger::ranger(). To learn which tuning parameter you can set, check the help file with ?ddml::mdl_ranger. A number of other wrapper commands are included in the ddml package and you can also write your own wrapper commands, see this guide.

# Define model configuration
learners = list(
  list(fun = mdl_ranger,
       args = list(num.trees = 1000, # random forest
                   max.depth = 8)))

learners_DX = list(
  list(fun = mdl_ranger,
       args = list(num.trees = 1000, # random forest
                   max.depth = 4)))

Stata code

We will save the hyper-parameters in a global macro which we use below.

global rfoutcome n_estimators(1000) max_depth(8)
global rfother n_estimators(1000) max_depth(4)

The effect of 401k elligibility

We replicate the DML estimate of the average treatment effects of 401k elligibility reported in Table 1, last column. Using the above random forest learners, we now estimate two different target paramters with DML: the partially linear regression coefficient and the average treatment effect.

R code

We use ensemble_type = "average" to disable stacking.

# DML estimation of the PLR coefficient
plm_fit <- ddml_plm(y, D, X,
                     learners = learners,
                     learners_DX = learners_DX,
                     sample_folds = 10,
                     ensemble_type = "average")

# DML estimation of the ATE 
ate_fit <- ddml_ate(y, D, X,
                     learners = learners,
                     learners_DX = learners_DX,
                     sample_folds = 10,
                     ensemble_type = "average",
                     trim = 0.001)
summary(ate_fit)

# create texreg-compatible object from `late_lasso`
library(texreg)
ate_texreg <- createTexreg(coef.names="elligibility",
                              coef=summary(ate_fit)[1],
                              se=summary(ate_fit)[2],
                              pvalues=summary(ate_fit)[4]
                              )

screenreg(list(plm_fit$ols_fit,ate_texreg),
          include.ci=FALSE,
          custom.coef.map=list("D_r"="elligibility",
                               "elligibility"="elligibility") 
          )

R output

======================================
              Model 1      Model 2    
--------------------------------------
elligibility  7825.37 ***  6877.41 ***
              (824.28)     (775.73)   
--------------------------------------
R^2              0.01                 
Adj. R^2         0.01                 
Num. obs.     9915                    
======================================
*** p < 0.001; ** p < 0.01; * p < 0.05

Stata code

*** partial regression coefficient 
ddml init partial, kfolds(10)
ddml E[Y|X]: pystacked $Y $X , type(reg) methods(rf) cmdopt1($rfoutcome)
ddml E[D|X]: pystacked $Z $X , type(reg) methods(rf) cmdopt1($rfother)
ddml crossfit
eststo m1: ///
ddml estimate, robust

*** ATE coefficient
ddml init partial, kfolds(10)
ddml E[Y|X]: pystacked $Y $X , type(reg) methods(rf) cmdopt1($rfoutcome)
ddml E[D|X]: pystacked $Z $X , type(reg) methods(rf) cmdopt1($rfother)
ddml crossfit
eststo m2: ///
ddml estimate, robust

Stata output

. esttab m1 m2

--------------------------------------------
                      (1)             (2)   
                  net_tfa         net_tfa   
--------------------------------------------
e401               7139.8***       7147.0***
                   (8.04)          (8.23)   

_cons              -58.85          -3.273   
                  (-0.17)         (-0.01)   
--------------------------------------------
N                    9915            9915   
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

The effect of 401k participation

The data also includes information on whether individuals signed up to the 401k scheme (p401). This allows us to estimate two other parameters: the local average treatment effect and the partially linear IV regression coefficient. In both cases, we leverage e401 as an instrument and assume that e401 is exogenous conditional on the set of covariates.

R code

# DML estimation of the PLR coefficient
pliv_fit <- ddml_pliv(y, D, Z, X,
                    learners = learners,
                    sample_folds = 10,
                    ensemble_type = "average"
)

# DML estimation of the ATE 
late_fit <- ddml_late(y, D, Z, X,
                    learners = learners, 
                    sample_folds = 10,
                    ensemble_type = "average",
                    trim = 0.001
                    )
# create screenreg-compatible object from `late_fit`
late_texreg <- createTexreg(coef.names="participation",
                           coef=summary(late_fit)[1],
                           se=summary(late_fit)[2],
                           pvalues=summary(late_fit)[4]
                           )
 
screenreg(list(pliv_fit$iv_fit,late_texreg),
          include.ci=FALSE,
          custom.coef.map=list("D_r"="participation",
                               "participation"="participation") 
          )

R output

=========================================
               Model 1       Model 2     
-----------------------------------------
participation  10968.64 ***   9766.53 ***
               (1212.56)     (1210.69)   
-----------------------------------------
R^2                0.02                  
Adj. R^2           0.02                  
Num. obs.       9915                     
=========================================
*** p < 0.001; ** p < 0.01; * p < 0.05

Stata code

*** partial IV regression coefficient 
ddml init iv, kfolds(10)
ddml E[Y|X]: pystacked $Y $X , type(reg) methods(rf) cmdopt1($rfoutcome)
ddml E[D|X]: pystacked $D $X , type(reg) methods(rf) cmdopt1($rfother)
ddml E[Z|X]: pystacked $Z $X , type(reg) methods(rf) cmdopt1($rfother)
ddml crossfit
eststo m3: ///
ddml estimate, robust

*** LATE coefficient
ddml init interactiveiv, kfolds(10)
ddml E[Y|X,Z]: pystacked $Y $X , type(reg) methods(rf) cmdopt1($rfoutcome)
ddml E[D|X,Z]: pystacked $D $X , type(reg) methods(rf) cmdopt1($rfother)
ddml E[Z|X]: pystacked $Z $X , type(reg) methods(rf) cmdopt1($rfother)
ddml crossfit
eststo m4: ///
ddml estimate, robust

Stata output

. esttab m3 m4

--------------------------------------------
                      (1)             (2)   
                  net_tfa         net_tfa   
--------------------------------------------
p401              10034.0***       9244.3***
                   (8.05)          (8.09)   

_cons              -28.05                   
                  (-0.08)                   
--------------------------------------------
N                    9915            9915   
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

We have demonstrated that we can target a wide range of target parameters using DML estimators. However, we shouldn’t take the results too seriously at this stage as we should validate our results by considering alternative nuisance function estimators, change the number of cross-fitting folds and repeat the cross-fitting procedure. We will do this in the next application, see Understanding Cultural Persistence and Change.

See Section 3 in the paper. ↩