The Effect of 401k Eligibility on Financial Wealth
In this brief illustration, we use data from the 1991 wave of the Survey of Income and Program Participation (SIPP), originally analyzed by Poterba, Venti & Wise (1995), to study how the 401(k) retirement plan affects individuals’ net financial assets.
Data preparations
The main variables of interest are:
| Variable | Description |
|---|---|
net_tfa | Net total financial assets (outcome) |
e401 | 401(k) eligibility |
p401 | 401(k) participation |
age | Age of the head of the household |
tw | Total wealth |
inc | Household income |
fsize | Household size |
db | Defined benefit pension status indicator |
marr | Married indicator |
twoearn | Two-earner status indicator |
pira | IRA participation indicator |
hown | House ownership indicator |
We load the data into R/Stata and prepare the data.
Code
To illustrate the implementation of DML in Stata, we use the ddml package. Note that you will also need to install pystacked, which implements a wide range of machine learners for Stata and uses Python’s sklearn internally. For pystacked to run, you will need to set up Stata’s Python integration, see instructions here.
clear all
set more off
set seed 20241111
use "https://dmlguide.github.io/assets/dta/PVW_data.dta", clear
global Y net_tfa
global X age tw inc fsize db marr twoearn pira hown
global D p401
global Z e401
To illustrate the implementation of DML in R, we use the ddml package. Another popular alternative is DoubleML, see here.
library(haven)
library(ddml)
library(ranger)
library(tidymodels)
dat <- read_dta("https://dmlguide.github.io/assets/dta/PVW_data.dta")
set.seed(20241111)
# Define control variables
control_names <- c("age", "tw", "inc", "fsize", "db", "marr",
"twoearn", "pira", "hown")
# Extract variables
y <- dat$net_tfa
D <- dat$e401 # eligibility
D_part <- dat$p401 # participation
Z <- dat$e401 # instrument
X <- as.matrix(dat[, control_names])
To illustrate the implementation of DML in Python, we use the DoubleML package, which integrates directly with scikit-learn learners.
import numpy as np
import pandas as pd
from doubleml import (DoubleMLData, DoubleMLPLR, DoubleMLIRM,
DoubleMLPLIV, DoubleMLIIVM)
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.base import clone
dat = pd.read_stata("https://dmlguide.github.io/assets/dta/PVW_data.dta")
np.random.seed(20241111)
# Define control variables
control_names = ["age", "tw", "inc", "fsize", "db", "marr",
"twoearn", "pira", "hown"]
Model specification
Before we can run the DML estimation, we need to select a nuisance function estimator. This could be linear regression or almost any supervised machine learner (satisfying a mild congerence rate requirement).1 While the choice of the machine learner is important and can affect the estimation results, we focus here for illustration on random forests with 1000 trees. For the outcome equation, we use a maximum tree depth of 8; for other nuisance functions we opt for a a maximum tree depth of 4.
Code
We will save the hyper-parameters in a global macro which we use below.
// Random forest with 1000 trees: depth 8 for the outcome equation,
// depth 4 for the remaining nuisance functions.
global rfoutcome n_estimators(1000) max_depth(8)
global rfother n_estimators(1000) max_depth(4)
Learner settings
ddml::mdl_ranger()is a wrapper ofranger::ranger(). To learn which tuning parameter you can set, check the help file with?ddml::mdl_ranger. A number of other wrapper commands are included in theddmlpackage and you can also write your own wrapper commands, see this guide.
# Random forest with 1000 trees: depth 8 for the outcome equation,
# depth 4 for the remaining nuisance functions.
learners <- list(
list(fun = mdl_ranger,
args = list(num.trees = 1000,
max.depth = 8))
)
learners_DX <- list(
list(fun = mdl_ranger,
args = list(num.trees = 1000,
max.depth = 4))
)
Learner settings
DoubleMLaccepts anyscikit-learnestimator as a nuisance learner. We passRandomForestRegressorfor continuous targets andRandomForestClassifierfor the binary treatment / instrument.
# Random forest with 1000 trees: depth 8 for the outcome equation,
# depth 4 for the remaining nuisance functions. We use regressors for
# continuous targets and classifiers for the binary treatment / instrument.
ml_outcome_reg = RandomForestRegressor(n_estimators=1000, max_depth=8,
random_state=20241111, n_jobs=-1)
ml_other_reg = RandomForestRegressor(n_estimators=1000, max_depth=4,
random_state=20241111, n_jobs=-1)
ml_other_cls = RandomForestClassifier(n_estimators=1000, max_depth=4,
random_state=20241111, n_jobs=-1)
The effect of 401k elligibility
We replicate the DML estimate of the average treatment effects of 401k elligibility reported in Table 1, last column. Using the above random forest learners, we now estimate two different target paramters with DML: the partially linear regression coefficient and the average treatment effect.
Code
// PLR: partially linear regression coefficient
ddml init partial, kfolds(10)
ddml E[Y|X]: pystacked $Y $X , type(reg) methods(rf) cmdopt1($rfoutcome)
ddml E[D|X]: pystacked $Z $X , type(reg) methods(rf) cmdopt1($rfother)
ddml crossfit
eststo m_plr: ddml estimate, robust
// ATE: average treatment effect (interactive model)
ddml init interactive, kfolds(10)
ddml E[Y|X,D]: pystacked $Y $X , type(reg) methods(rf) cmdopt1($rfoutcome)
ddml E[D|X]: pystacked $Z $X , type(class) methods(rf) cmdopt1($rfother)
ddml crossfit
eststo m_ate: ddml estimate, robust trim(0.001)
esttab m_plr m_ate, mtitles("PLR" "ATE") ///
b(%9.2f) se(%9.2f) star(* 0.05 ** 0.01 *** 0.001) ///
stats(N, fmt(%9.0f) labels("Num. obs."))
We use ensemble_type = "average" to disable stacking.
# DML estimation of the partially linear regression (PLR) coefficient
plm_fit <- ddml_plm(y, D, X,
learners = learners,
learners_DX = learners_DX,
sample_folds = 10,
ensemble_type = "average")
# DML estimation of the average treatment effect (ATE)
ate_fit <- ddml_ate(y, D, X,
learners = learners,
learners_DX = learners_DX,
sample_folds = 10,
ensemble_type = "average",
trim = 0.001)
# Tidy comparison of the two estimands using ddml::tidy()
elig_tbl <- bind_rows(
PLR = tidy(plm_fit) |> filter(term == "D1"),
ATE = tidy(ate_fit),
.id = "model"
) |> as_tibble()
print(elig_tbl)
# Build the DoubleMLData object (treatment = e401)
dml_data_e = DoubleMLData(dat, y_col="net_tfa", d_cols="e401",
x_cols=control_names)
# PLR: partially linear regression coefficient
plr_fit = DoubleMLPLR(dml_data_e,
ml_l=clone(ml_outcome_reg),
ml_m=clone(ml_other_reg),
n_folds=10)
plr_fit.fit()
# ATE: average treatment effect (interactive regression model)
irm_fit = DoubleMLIRM(dml_data_e,
ml_g=clone(ml_outcome_reg),
ml_m=clone(ml_other_cls),
n_folds=10,
trimming_threshold=0.001)
irm_fit.fit()
# Side-by-side comparison of the two estimands
def compare(fits, labels, row_label, n_obs):
fmt = lambda x: f"{x:>12.2f}"
rows = [
f"{row_label:<14}" + "".join(f"{l:>12}" for l in labels),
f" estimate " + "".join(fmt(f.coef[0]) for f in fits),
f" (SE) " + "".join(fmt(f.se[0]) for f in fits),
f" p-value " + "".join(fmt(f.pval[0]) for f in fits),
f" Num. obs. " + "".join(f"{n_obs:>12d}" for _ in fits),
]
return "\n".join(rows)
print(compare([plr_fit, irm_fit], ["PLR", "ATE"],
"eligibility", dml_data_e.n_obs))
Output
--------------------------------------------
(1) (2)
PLR ATE
--------------------------------------------
e401 7061.52*** 6099.18***
(857.95) (780.94)
_cons 1.11
(338.38)
--------------------------------------------
Num. obs. 9915 9915
--------------------------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001
# A tibble: 2 × 7
model term estimate std.error statistic p.value ensemble_type
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 PLR D1 7826. 907. 8.63 6.10e-18 average
2 ATE ATE 6901. 771. 8.95 3.63e-19 average
eligibility PLR ATE
estimate 6737.64 6344.02
(SE) 909.32 719.96
p-value 0.00 0.00
Num. obs. 9915 9915
The effect of 401k participation
The data also includes information on whether individuals signed up to the 401k scheme (p401). This allows us to estimate two other parameters: the local average treatment effect and the partially linear IV regression coefficient. In both cases, we leverage e401 as an instrument and assume that e401 is exogenous conditional on the set of covariates.
Code
// PLIV: partially linear IV regression coefficient
ddml init iv, kfolds(10)
ddml E[Y|X]: pystacked $Y $X , type(reg) methods(rf) cmdopt1($rfoutcome)
ddml E[D|X]: pystacked $D $X , type(reg) methods(rf) cmdopt1($rfother)
ddml E[Z|X]: pystacked $Z $X , type(reg) methods(rf) cmdopt1($rfother)
ddml crossfit
eststo m_pliv: ddml estimate, robust
// LATE: local average treatment effect (interactive IV model)
ddml init interactiveiv, kfolds(10)
ddml E[Y|X,Z]: pystacked $Y $X , type(reg) methods(rf) cmdopt1($rfoutcome)
ddml E[D|X,Z]: pystacked $D $X , type(class) methods(rf) cmdopt1($rfother)
ddml E[Z|X]: pystacked $Z $X , type(class) methods(rf) cmdopt1($rfother)
ddml crossfit
eststo m_late: ddml estimate, robust trim(0.001)
esttab m_pliv m_late, mtitles("PLIV" "LATE") ///
b(%9.2f) se(%9.2f) star(* 0.05 ** 0.01 *** 0.001) ///
stats(N, fmt(%9.0f) labels("Num. obs."))
# DML estimation of the partially linear IV regression (PLIV) coefficient
pliv_fit <- ddml_pliv(y, D_part, Z, X,
learners = learners,
sample_folds = 10,
ensemble_type = "average")
# DML estimation of the local average treatment effect (LATE)
late_fit <- ddml_late(y, D_part, Z, X,
learners = learners,
sample_folds = 10,
ensemble_type = "average",
trim = 0.001)
# Tidy comparison of the two estimands using ddml::tidy()
part_tbl <- bind_rows(
PLIV = tidy(pliv_fit) |> filter(term == "D1"),
LATE = tidy(late_fit),
.id = "model"
) |> as_tibble()
print(part_tbl)
# Build the DoubleMLData object with e401 as instrument for p401
dml_data_p = DoubleMLData(dat, y_col="net_tfa", d_cols="p401",
z_cols="e401", x_cols=control_names)
# PLIV: partially linear IV regression coefficient
pliv_fit = DoubleMLPLIV(dml_data_p,
ml_l=clone(ml_outcome_reg),
ml_m=clone(ml_other_reg),
ml_r=clone(ml_other_reg),
n_folds=10)
pliv_fit.fit()
# LATE: local average treatment effect (interactive IV model)
iivm_fit = DoubleMLIIVM(dml_data_p,
ml_g=clone(ml_outcome_reg),
ml_m=clone(ml_other_cls),
ml_r=clone(ml_other_cls),
n_folds=10,
trimming_threshold=0.001)
iivm_fit.fit()
print(compare([pliv_fit, iivm_fit], ["PLIV", "LATE"],
"participation", dml_data_p.n_obs))
Output
--------------------------------------------
(1) (2)
PLIV LATE
--------------------------------------------
p401 9896.56*** 9313.53***
(1250.05) (1148.57)
_cons -123.79
(335.92)
--------------------------------------------
Num. obs. 9915 9915
--------------------------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001
# A tibble: 2 × 7
model term estimate std.error statistic p.value ensemble_type
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 PLIV D1 10746. 1323. 8.12 4.59e-16 average
2 LATE LATE 9611. 1205. 7.98 1.51e-15 average
participation PLIV LATE
estimate 10375.66 9343.24
(SE) 1290.76 1045.90
p-value 0.00 0.00
Num. obs. 9915 9915
We have demonstrated that we can target a wide range of target parameters using DML estimators. However, we shouldn’t take the results too seriously at this stage as we should validate our results by considering alternative nuisance function estimators, change the number of cross-fitting folds and repeat the cross-fitting procedure. We will do this in the next application, see Understanding Cultural Persistence and Change.
-
See Section 3 in the paper. ↩