Understanding Cultural Persistence and Change
In this example, we illustrate DML estimation of the partially linear model using a simple empirical example. The application is drawn from Giuliano and Nunn (2021, henceforth GN), who look at the relationship between climate instability and cultural persistence using a number of different datasets.
GN draw on evolutionary anthropology to argue that when the environment changes slowly across generations, traditions inherited from one’s ancestors carry information that is still useful today, so societies place greater weight on maintaining them. When the environment is volatile, inherited customs are less reliable guides and tradition matters less. Using paleoclimatic temperature and drought data covering 500–1900 CE, GN construct a measure of cross-generational climatic instability — defined as the the standard deviation of 20-year climate averages across 70 generations — and link it to ancestral groups via their pre-industrial locations.
Estimation Strategy
Their first estimation strategy uses cross-country regressions where the dependent variable ($Y$) is a measure of the importance of tradition taken from the World Values Survey, and the causal variable of interest ($D$) is a measure of ancestral climatic instability. The dataset is quite small: only 74 countries.
The example is useful for illustrating how DML works for several reasons: the dataset is small and available online, visualization is easy, reproduction of results is straightforward, and the example shows how DML can be used as a robustness check even in the simplest of settings. The GN dataset used for this demonstration is available here.
GN are concerned about omitted confounders, and include 4 controls to address the issue. The outcome, treatment and control variables are:
| Variable | Description |
|---|---|
A198new | The outcome variable of interest: country-level average of the self-reported importance of tradition. Ranges from 1 to 6 (bigger=more important). |
sd_EE | The causal variable of interest: a measure of ancestral climatic instability (standard deviation of temperature anomaly measure across generations; see GN for details). |
v104_ee | Control #1: distance from the equator. |
settlement_ee | Control #2: early economic development (proxied by complexity of settlements). |
polhierarchies_ee | Control #3: political hierarchies (a measure of political centralization). |
loggdp | Control #4: log GDP per capita in the country of origin at the time of the survey. |
Their model with controls is one where the controls enter linearly and is estimated using OLS. The effect of climatic instability on the importance of tradition is negative, with a coefficient that is different from zero at conventional significance levels.
Replication
We start by reproducing GN’s Figure 5 — a simple scatterplot of the bivariate relationship — and then add the four control variables to reproduce column (4).
GN’s Figure 5 visualizes the unconditional relationship between climatic instability and the importance of tradition. The fitted line is the OLS coefficient from column (3): $-1.92$ ($s.e.=0.523$).
Code
// Figure 5 visualises the bivariate (no-controls) relationship — column (3).
reg A198new sd_EE, robust
predict hat
twoway (scatter A198new sd_EE, msymbol(circle) msize(medium)) ///
(line hat sd_EE, sort lwidth(medium)), ///
xlabel(0(0.25)0.5) ylabel(3(1)6) ///
xtitle("Climatic instability") ytitle("Importance of tradition") ///
legend(off) ///
text(2.85 0 "(coef = -1.92, t = -3.68)", placement(east))
drop hat
# Figure 5 visualises the bivariate (no-controls) relationship — column (3).
model_bivariate <- lm(A198new ~ sd_EE, data = gn)
gn$hat <- fitted(model_bivariate)
p_fig5 <- ggplot(gn, aes(x = sd_EE, y = A198new, label = isocode)) +
geom_point(color = "#1f77b4", size = 2) +
geom_text(color = "#1f77b4", vjust = -0.6, size = 3) +
geom_line(aes(y = hat), linewidth = 1.2, color = "#B1004F") +
scale_x_continuous(breaks = c(0, 0.25, 0.5), limits = c(0, 0.5)) +
scale_y_continuous(breaks = c(3, 4, 5, 6)) +
coord_cartesian(ylim = c(2.8, 6)) +
labs(x = "Climatic instability", y = "Importance of tradition") +
theme_minimal() +
theme(
panel.grid = element_blank(),
legend.position = "none",
axis.title = element_text(size = 12),
axis.text = element_text(size = 10)
) +
annotate("text", x = 0, y = 2.85,
label = "(coef = -1.92, t = -3.68)", hjust = 0, size = 4)
print(p_fig5)
gn$hat <- NULL
# Figure 5 visualises the bivariate (no-controls) relationship — column (3).
X_bi = sm.add_constant(gn[["sd_EE"]], has_constant="add")
model_bivariate = sm.OLS(gn["A198new"].astype(float), X_bi).fit()
gn["hat"] = model_bivariate.predict(X_bi)
fig, ax = plt.subplots(figsize=(7, 5))
ax.scatter(gn["sd_EE"], gn["A198new"], s=30)
for _, row in gn.iterrows():
ax.text(row["sd_EE"], row["A198new"], str(row["isocode"]),
fontsize=8, va="bottom", ha="center")
tmp = gn.sort_values("sd_EE")
ax.plot(tmp["sd_EE"], tmp["hat"], linewidth=2)
ax.set_xlim(0, 0.5)
ax.set_xticks([0, 0.25, 0.5])
ax.set_ylim(2.8, 6.0)
ax.set_yticks([3, 4, 5, 6])
ax.set_xlabel("Climatic instability")
ax.set_ylabel("Importance of tradition")
ax.grid(False)
ax.text(0, 2.85, "(coef = -1.92, t = -3.68)", fontsize=10, ha="left", va="bottom")
plt.tight_layout()
plt.show()
gn = gn.drop(columns=["hat"])
Figure



Adding the four control variables yields GN’s headline column-(4) estimate of $-1.824$ ($s.e.=0.696$). They interpret the result as follows (p. 155):
Based on the estimates from column 4, a one-standard-deviation increase in cross-generational instability (0.11) is associated with a reduction in the tradition index of 1.824×0.11=0.20, which is 36% of a standard deviation of the tradition variable.
Code
clear all
set more off
set seed 42
// Load data
use "https://dmlguide.github.io/assets/dta/GN2021.dta", clear
// Shorten the name of the causal variable of interest to sd_EE
gen sd_EE = sd_of_gen_mean_temp_anom_EE
// Replicate Table 1, column (4): OLS with the 4 controls, HC1 robust SE.
// One country has loggdp missing; reg drops it automatically (N = 74).
reg A198new sd_EE v104_ee settlement_ee polhierarchies_ee loggdp, robust
set.seed(42)
library(haven)
library(dplyr)
library(sandwich)
library(lmtest)
library(ggplot2)
library(ddml)
library(ranger)
library(glmnet)
# Load data
gn <- read_dta("https://dmlguide.github.io/assets/dta/GN2021.dta")
# Shorten the name of the causal variable of interest to sd_EE
gn$sd_EE <- gn$sd_of_gen_mean_temp_anom_EE
# Replicate Table 1, column (4): OLS with the 4 controls, HC1 robust SE.
# lm() drops the country with missing loggdp automatically (N = 74).
model_full <- lm(A198new ~ sd_EE + v104_ee + settlement_ee +
polhierarchies_ee + loggdp, data = gn)
coeftest(model_full, vcov = vcovHC(model_full, type = "HC1"))
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Load data
gn = pd.read_stata("https://dmlguide.github.io/assets/dta/GN2021.dta")
# Shorten the name of the causal variable of interest to sd_EE
gn["sd_EE"] = gn["sd_of_gen_mean_temp_anom_EE"]
# Replicate Table 1, column (4): OLS with the 4 controls, HC1 robust SE.
# Drop the country with missing loggdp (N goes from 75 to 74).
gn_full = gn.dropna(subset=["loggdp"]).copy()
controls = ["v104_ee", "settlement_ee", "polhierarchies_ee", "loggdp"]
X_full = sm.add_constant(gn_full[["sd_EE"] + controls], has_constant="add")
y_full = gn_full["A198new"].astype(float)
model_full = sm.OLS(y_full, X_full).fit()
model_full_robust = model_full.get_robustcov_results(cov_type="HC1")
print(model_full_robust.summary())
Output
Linear regression Number of obs = 74
F(5, 68) = 7.63
Prob > F = 0.0000
R-squared = 0.3876
Root MSE = .4485
------------------------------------------------------------------------------
| Robust
A198new | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
sd_EE | -1.823516 .6958312 -2.62 0.011 -3.212026 -.4350068
v104_ee | .0054164 .0053679 1.01 0.317 -.0052951 .0161279
settlemen~ee | -.0651081 .0347562 -1.87 0.065 -.134463 .0042467
polhierar~ee | .0127799 .0971405 0.13 0.896 -.181061 .2066208
loggdp | -.1648961 .0486385 -3.39 0.001 -.2619527 -.0678394
_cons | 6.576112 .5359645 12.27 0.000 5.506611 7.645613
------------------------------------------------------------------------------
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.5761120 0.5359645 12.2697 < 2.2e-16 ***
sd_EE -1.8235164 0.6958312 -2.6206 0.010817 *
v104_ee 0.0054164 0.0053679 1.0090 0.316534
settlement_ee -0.0651081 0.0347562 -1.8733 0.065328 .
polhierarchies_ee 0.0127799 0.0971405 0.1316 0.895720
loggdp -0.1648961 0.0486385 -3.3902 0.001166 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
OLS Regression Results
==============================================================================
Dep. Variable: A198new R-squared: 0.388
Model: OLS Adj. R-squared: 0.343
Method: Least Squares F-statistic: 7.626
Date: Tue, 02 Jun 2026 Prob (F-statistic): 9.77e-06
Time: 22:49:21 Log-Likelihood: -42.536
No. Observations: 74 AIC: 97.07
Df Residuals: 68 BIC: 110.9
Df Model: 5
Covariance Type: HC1
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const 6.5761 0.536 12.270 0.000 5.507 7.646
sd_EE -1.8235 0.696 -2.621 0.011 -3.212 -0.435
v104_ee 0.0054 0.005 1.009 0.317 -0.005 0.016
settlement_ee -0.0651 0.035 -1.873 0.065 -0.134 0.004
polhierarchies_ee 0.0128 0.097 0.132 0.896 -0.181 0.207
loggdp -0.1649 0.049 -3.390 0.001 -0.262 -0.068
==============================================================================
Omnibus: 3.680 Durbin-Watson: 1.959
Prob(Omnibus): 0.159 Jarque-Bera (JB): 3.823
Skew: -0.118 Prob(JB): 0.148
Kurtosis: 4.088 Cond. No. 480.
==============================================================================
Notes:
[1] Standard Errors are heteroscedasticity robust (HC1)
DML as a robustness check
A useful robustness check is to ask whether the results with controls are sensitive to the assumption that the controls enter linearly. DML can help answer this question.
Below we estimate the model with controls using DML and the partially-linear model (PLM). The specification is still very simple: we keep the assumption that the controls enter separately from the treatment, but we drop linearity.
The estimation procedure takes the following steps:
- Specify the target estimand: in this case, the PLM coefficient.
- Specify the learners: choose the learners for the conditional expectation functions for the outcome variable and the causal variable of interest.
- Cross-fitting: obtain cross-fit estimates of the two conditional expectations.
- Estimation of structural parameter: regress the residualized outcome variable on the residualized causal variable of interest.
We have no strong priors on whether a linear or non-linear learner is more appropriate, or whether regularization is needed at all. Hence we use 3 learners in this example:
- unregularized OLS,
- cross-validated Lasso,
- and a random forest.
We also use model averaging to combine the estimated conditional expectations from these 3 learners. Specifically, we use short-stacking which is a computationally cheap and fast way of pairing DML and model averaging (see Ahrens et al. (2025) for a description of the algorithm and other model averaging strategies). Also to make the example run quickly the initial estimation is done just once, i.e., we do not resample (repeat the cross-fit based on different splits).
The dataset is small, and so we choose 10-fold cross-fitting. This means that learners are trained on about 66-67 observations, and OOS predictions are obtained for the remaining 7-8 observations. (If the dataset were larger, we could also consider a smaller number of folds to save computational time.)
Code
// In the GN estimations, GDP is missing for one country.
// For simplicity, we just drop this one country so that it is never used.
drop if missing(loggdp)
// To make the code more readable, define locals for Y, D and X:
local Y_var "A198new"
local D_var "sd_EE"
local X_vars "v104_ee settlement_ee polhierarchies_ee loggdp"
// In Stata, we now use ddml with pystacked for OLS, cross-validated lasso,
// and random forest learners combined via short-stacking (nnls1).
// Step 1: Initialize ddml for partially linear model
ddml init partial, kfolds(10)
// Step 2: Specify the model components
// Outcome equation
ddml E[Y|X]: pystacked `Y_var' `X_vars', ///
type(reg) methods(ols lassocv rf)
// Treatment equation
ddml E[D|X]: pystacked `D_var' `X_vars', ///
type(reg) methods(ols lassocv rf)
// Step 3: Cross-fit with short-stacking
// Set seed for this single split
set seed 12345
ddml crossfit, shortstack nostdstack
// Step 4: Estimate the partially linear model
ddml estimate, robust
# In the GN estimations, GDP is missing for one country.
# For simplicity, we just drop this one country so that it is never used.
gn <- gn |> filter(!is.na(loggdp))
# To make the code more readable, define locals for Y, D and X:
Y_var <- "A198new"
D_var <- "sd_EE"
X_vars <- c("v104_ee", "settlement_ee", "polhierarchies_ee", "loggdp")
# In R, we now use ddml with learners for OLS, cross-validated lasso,
# and random forest combined via short-stacking (nnls1).
# Prepare data matrices
Y <- gn[[Y_var]]
D <- gn[[D_var]]
X <- as.matrix(gn[, X_vars])
# Step 1: Specify learners for partially linear model
learners <- list(
list(fun = ols),
list(fun = mdl_glmnet, args = list(alpha = 1, nfolds = 5)), # lasso (5-fold CV)
list(fun = mdl_ranger, args = list(num.trees = 500, min.node.size = 5))
)
# Step 2: Set seed for this single split
set.seed(12345)
# Step 3: Cross-fit with short-stacking
ddml_fit <- ddml_plm(
y = Y,
D = D,
X = X,
learners = learners,
sample_folds = 10,
ensemble_type = "nnls1",
shortstack = TRUE,
cv_folds = 5,
silent = FALSE
)
# Step 4: Estimate the partially linear model
summary(ddml_fit, type = "HC1")
# DoubleML does not natively implement short-stacking, so we hand-roll
# the cross-fitting with sklearn and combine the per-learner OOF
# predictions via non-negative least squares constrained to sum to one
# (the "nnls1" stacking used by ddml / pystacked).
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from scipy.optimize import minimize
def stack_nnls1(A, b):
"""Solve min ||A w - b||^2 subject to w >= 0 and sum(w) = 1."""
k = A.shape[1]
res = minimize(lambda w: np.sum((A @ w - b) ** 2),
x0=np.full(k, 1.0 / k),
method="SLSQP",
bounds=[(0.0, 1.0)] * k,
constraints=[{"type": "eq", "fun": lambda w: w.sum() - 1.0}],
options={"ftol": 1e-12})
return res.x
# In the GN estimations, GDP is missing for one country.
# For simplicity, we drop this one country so that it is never used.
gn = gn.dropna(subset=["loggdp"]).copy().reset_index(drop=True)
# To make the code more readable, define variables for Y, D and X:
Y_var = "A198new"
D_var = "sd_EE"
X_vars = ["v104_ee", "settlement_ee", "polhierarchies_ee", "loggdp"]
Y = gn[Y_var].astype(float).to_numpy()
D = gn[D_var].astype(float).to_numpy()
X = gn[X_vars].astype(float).to_numpy()
# Set seed for replicability.
np.random.seed(12345)
# Step 1: Define the three candidate learners — OLS, CV-Lasso, RandomForest.
learners = {
"OLS": lambda: LinearRegression(),
"Lasso": lambda: LassoCV(cv=5, random_state=12345, max_iter=10000),
"RF": lambda: RandomForestRegressor(random_state=12345),
}
# Step 2: 10-fold OOF predictions, same fold split for every learner.
kf = KFold(n_splits=10, shuffle=True, random_state=12345)
def cv_predict(make_estimator, target):
preds = np.empty_like(target, dtype=float)
for tr, te in kf.split(X):
est = make_estimator()
est.fit(X[tr], target[tr])
preds[te] = est.predict(X[te])
return preds
Yhat = {name: cv_predict(make, Y) for name, make in learners.items()}
Dhat = {name: cv_predict(make, D) for name, make in learners.items()}
# Step 3: Short-stack via sum-to-one constrained NNLS on the OOF predictions.
Yhat_mat = np.column_stack([Yhat["OLS"], Yhat["Lasso"], Yhat["RF"]])
Dhat_mat = np.column_stack([Dhat["OLS"], Dhat["Lasso"], Dhat["RF"]])
w_Y = stack_nnls1(Yhat_mat, Y)
w_D = stack_nnls1(Dhat_mat, D)
Yhat_ss = Yhat_mat @ w_Y
Dhat_ss = Dhat_mat @ w_D
# Step 4: Estimate theta from the residualised OLS, HC1 robust SE.
resY = Y - Yhat_ss
resD = D - Dhat_ss
fit = sm.OLS(resY, sm.add_constant(resD)).fit().get_robustcov_results(cov_type="HC1")
print(f"theta = {fit.params[1]:.4f} (SE {fit.bse[1]:.4f})")
Output
Model: partial, crossfit folds k=10, resamples r=1
Mata global (mname): m0
Dependent variable (Y): A198new
A198new learners: Y1_pystacked
D equations (1): sd_EE
sd_EE learners: D1_pystacked
DDML estimation results:
spec r Y learner D learner b SE
ss 1 [shortstack] [ss] -2.338 (0.805)
Shortstack DDML model
y-E[y|X] = y-Y_A198new_ss_1 Number of obs = 74
D-E[D|X] = D-D_sd_EE_ss_1
------------------------------------------------------------------------------
| Robust
A198new | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
sd_EE | -2.338472 .8052145 -2.90 0.004 -3.916663 -.7602802
_cons | -.0069149 .0521399 -0.13 0.894 -.1091072 .0952774
------------------------------------------------------------------------------
Stacking final estimator: nnls1
DDML estimation: Partially Linear Model
Obs: 74 Folds: 10 Stacking: short-stack
Estimate Std. Error z value Pr(>|z|)
D1 -2.06604 0.85642 -2.41 0.016 *
(Intercept) 0.00281 0.05339 0.05 0.958
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
theta = -2.3267 (SE 0.7718)
Typically the DML short-stacked results are quite close to the original linear specification used by GN. A natural interpretation is that the GN results still stand in this robustness check. This does not mean, however, that the linear specification was the “right” choice.
The stacking weights (shown below) provide insights on the relative importance of linear vs non-linear learners: Depending on the randomization in the cross-fit split and the learners, the short-stacking weights typically put a low weight on unregularized OLS. The random forest learner also typically get a substantial weight in both of the conditional expectation estimates. This suggests that some nonlinearity is present, but that the assumption of linearity does not substantially affect the results.
Code
// Display the short-stack weights.
ddml extract, show(ssweights)
# Display short-stack weights from the single-split fit.
print(ddml_fit$ensemble_weights)
# w_Y, w_D are the NNLS weights from the hand-rolled short-stack above.
print("Short-stack NNLS weights (Y eq): "
f"OLS={w_Y[0]:.3f}, Lasso={w_Y[1]:.3f}, RF={w_Y[2]:.3f}")
print("Short-stack NNLS weights (D eq): "
f"OLS={w_D[0]:.3f}, Lasso={w_D[1]:.3f}, RF={w_D[2]:.3f}")
Output
short-stacked weights across resamples for A198new
final stacking estimator: nnls1
learner mean_weight rep_1
ols 1 0 0
lassocv 2 .30571302 .30571302
rf 3 .69428698 .69428698
short-stacked weights across resamples for sd_EE
final stacking estimator: nnls1
learner mean_weight rep_1
ols 1 2.901e-18 2.901e-18
lassocv 2 .36439938 .36439938
rf 3 .63560062 .63560062
$y_X
nnls1
[1,] -4.790797e-17
[2,] 2.209719e-01
[3,] 7.790281e-01
$D1_X
nnls1
[1,] -2.759759e-17
[2,] 2.180738e-01
[3,] 7.819262e-01
Short-stack NNLS weights (Y eq): OLS=0.000, Lasso=0.525, RF=0.475
Short-stack NNLS weights (D eq): OLS=0.000, Lasso=0.335, RF=0.665
Results by learner
We now inspect the DML results by learner. We use the same learner for both estimating the outcome and treatment equation. But we could also consider combinations where we use different learners by equation.
Estimation table
Code
// ddml creates conditional expectations for each learner.
// With pystacked, the learners are numbered.
// The final "_1" identifies the resample (the single cross-fit split).
// Globals used by the residual-construction lines below.
global Y A198new
global D sd_EE
// Create residualized Y and D for each learner.
cap drop Y_L1_r Y_L2_r Y_L3_r Y_ss_r D_L1_r D_L2_r D_L3_r D_ss_r
gen Y_L1_r = $Y-Y1_pystacked_L1_1
gen Y_L2_r = $Y-Y1_pystacked_L2_1
gen Y_L3_r = $Y-Y1_pystacked_L3_1
gen Y_ss_r = $Y-Y_A198new_ss_1
gen D_L1_r = $D-D1_pystacked_L1_1
gen D_L2_r = $D-D1_pystacked_L2_1
gen D_L3_r = $D-D1_pystacked_L3_1
gen D_ss_r = $D-D_sd_EE_ss_1
// Per-learner residual regressions; store for the comparison table.
qui reg Y_L1_r D_L1_r, robust
est store ddml_L1_L1, title("OLS learners")
qui reg Y_L2_r D_L2_r, robust
est store ddml_L2_L2, title("Lasso learners")
qui reg Y_L3_r D_L3_r, robust
est store ddml_L3_L3, title("RF learners")
qui reg Y_ss_r D_ss_r, robust
est store ddml_SS_SS, title("SS learners")
// Compare DML estimations for learner combinations:
// nb: Code uses estout by Ben Jann; install from SSC Archives.
set linesize 200
estout ddml_L1_L1 ddml_L2_L2 ddml_L3_L3 ddml_SS_SS, ///
label modelwidth(15) collabels(none) ///
rename (D_L1_r D_r D_L2_r D_r D_L3_r D_r D_ss_r D_r) ///
cells(b(fmt(3)) se(par fmt(3))) ///
varlabels(_cons Constant)
# The ddml_plm fit object exposes per-learner OOF residuals via
# fit$fitted$y_X$cf_resid_bylearner (n x #learners)
# fit$fitted$D1_X$cf_resid_bylearner
# and the short-stacked predictions via $cf_fitted.
Y_resids <- list(
OLS = ddml_fit$fitted$y_X$cf_resid_bylearner[, 1],
Lasso = ddml_fit$fitted$y_X$cf_resid_bylearner[, 2],
RF = ddml_fit$fitted$y_X$cf_resid_bylearner[, 3],
SS = as.numeric(Y - ddml_fit$fitted$y_X$cf_fitted[, "nnls1"])
)
D_resids <- list(
OLS = ddml_fit$fitted$D1_X$cf_resid_bylearner[, 1],
Lasso = ddml_fit$fitted$D1_X$cf_resid_bylearner[, 2],
RF = ddml_fit$fitted$D1_X$cf_resid_bylearner[, 3],
SS = as.numeric(D - ddml_fit$fitted$D1_X$cf_fitted[, "nnls1"])
)
# Comparison table: HC1 robust SE per learner
build_row <- function(key) {
est <- lm(Y_resids[[key]] ~ D_resids[[key]])
ct <- coeftest(est, vcov = vcovHC(est, type = "HC1"))
data.frame(
spec = key,
b = unname(ct[2, "Estimate"]),
se_d = unname(ct[2, "Std. Error"]),
intercept = unname(ct[1, "Estimate"]),
se_int = unname(ct[1, "Std. Error"])
)
}
tab <- do.call(rbind, lapply(c("OLS", "Lasso", "RF", "SS"), build_row))
print(tab, row.names = FALSE)
Yhat["SS"] = Yhat_mat @ w_Y
Dhat["SS"] = Dhat_mat @ w_D
resY_dict = {k: Y - Yhat[k] for k in Yhat}
resD_dict = {k: D - Dhat[k] for k in Dhat}
rows = []
for key in ["OLS", "Lasso", "RF", "SS"]:
Xc = sm.add_constant(resD_dict[key])
f_k = sm.OLS(resY_dict[key], Xc).fit().get_robustcov_results(cov_type="HC1")
rows.append({"spec": key,
"b": f_k.params[1], "se_d": f_k.bse[1],
"intercept": f_k.params[0], "se_int": f_k.bse[0]})
tbl = pd.DataFrame(rows)
print(tbl.to_string(index=False))
Output
------------------------------------------------------------------------------------
OLS learners Lasso learners RF learners SS learners
------------------------------------------------------------------------------------
D_r -1.866 -1.892 -2.265 -2.338
(0.705) (0.694) (0.741) (0.805)
Constant 0.002 0.003 -0.011 -0.007
(0.056) (0.055) (0.053) (0.052)
------------------------------------------------------------------------------------
DML estimates by learner (resid_Y ~ resid_D, HC1 robust SE):
spec b se_d intercept se_int
OLS -1.744850 0.7226074 -0.010256185 0.05786292
Lasso -1.713238 0.7147755 0.001572231 0.05691008
RF -2.126607 0.8484537 0.002987867 0.05330342
SS -2.066040 0.8564242 0.002813695 0.05338584
DML estimates by learner (resid_Y ~ resid_D, HC1 robust SE):
spec b se_d intercept se_int
OLS -1.851188 0.684268 -0.005177 0.054723
Lasso -1.871904 0.689664 -0.004357 0.053601
RF -2.353444 0.740676 -0.003409 0.052612
SS -2.326678 0.771804 -0.004310 0.051630
Scatterplots
The GN dataset is very small. This makes it easy to visualize the conditional relationship by plotting the residualized outcome and residualized treatment, where residualization uses one of the learners.
Code
// Restore each stored estimate so predict() uses the right regression,
// then plot residualized Y vs residualized D with the fitted line.
estimates restore ddml_L1_L1
cap drop hat_1_1
predict hat_1_1
twoway (scatter Y_L1_r D_L1_r, msymbol(o) mlabel(isocode)) ///
(line hat_1_1 D_L1_r, sort lwidth(thick)), ///
legend(off) xlabel(-0.3 -0.2 -0.1 0 0.1 0.2 0.3, nogrid) ///
ylabel(-2 -1 0 1 2, nogrid) ///
title("Y and D: Unregularized OLS") ///
ytitle("Importance of tradition") ///
xtitle("Climatic instability") name(OLS, replace)
estimates restore ddml_L2_L2
cap drop hat_2_2
predict hat_2_2
twoway (scatter Y_L2_r D_L2_r, msymbol(o) mlabel(isocode)) ///
(line hat_2_2 D_L2_r, sort lwidth(thick)), ///
legend(off) xlabel(-0.3 -0.2 -0.1 0 0.1 0.2 0.3, nogrid) ///
ylabel(-2 -1 0 1 2, nogrid) ///
title("Y and D: Lasso") ///
ytitle("Importance of tradition") ///
xtitle("Climatic instability") name(Lasso, replace)
estimates restore ddml_L3_L3
cap drop hat_3_3
predict hat_3_3
twoway (scatter Y_L3_r D_L3_r, msymbol(o) mlabel(isocode)) ///
(line hat_3_3 D_L3_r, sort lwidth(thick)), ///
legend(off) xlabel(-0.3 -0.2 -0.1 0 0.1 0.2 0.3, nogrid) ///
ylabel(-2 -1 0 1 2, nogrid) ///
title("Y and D: Random Forest") ///
ytitle("Importance of tradition") ///
xtitle("Climatic instability") name(RF, replace)
estimates restore ddml_SS_SS
cap drop hat_ss_ss
predict hat_ss_ss
twoway (scatter Y_ss_r D_ss_r, msymbol(o) mlabel(isocode)) ///
(line hat_ss_ss D_ss_r, sort lwidth(thick)), ///
legend(off) xlabel(-0.3 -0.2 -0.1 0 0.1 0.2 0.3, nogrid) ///
ylabel(-2 -1 0 1 2, nogrid) ///
title("Y and D: Short-stacked") ///
ytitle("Importance of tradition") ///
xtitle("Climatic instability") name(SS, replace)
graph combine OLS Lasso RF SS
library(patchwork)
iso <- gn$isocode
panel_titles <- c(OLS = "Y and D: Unregularized OLS",
Lasso = "Y and D: Lasso",
RF = "Y and D: Random Forest",
SS = "Y and D: Short-stacked")
make_panel <- function(key) {
df <- data.frame(y = Y_resids[[key]], d = D_resids[[key]], iso = iso)
est <- lm(y ~ d, data = df)
df$hat <- fitted(est)
ggplot(df, aes(x = d, y = y, label = iso)) +
geom_point(color = "#1f77b4", size = 1.6) +
geom_text(color = "#1f77b4", vjust = -0.6, size = 2.6) +
geom_line(aes(y = hat), linewidth = 1.0, color = "#B1004F") +
scale_x_continuous(breaks = seq(-0.3, 0.3, 0.1), limits = c(-0.3, 0.3)) +
scale_y_continuous(breaks = seq(-2, 2, 1), limits = c(-2, 2)) +
labs(title = panel_titles[[key]],
x = "Climatic instability",
y = "Importance of tradition") +
theme_minimal() +
theme(panel.grid = element_blank(),
plot.title = element_text(size = 10, face = "bold"),
axis.title = element_text(size = 9),
axis.text = element_text(size = 8))
}
p_all <- (make_panel("OLS") | make_panel("Lasso")) /
(make_panel("RF") | make_panel("SS"))
print(p_all)
import matplotlib.pyplot as plt
iso = gn["isocode"].to_numpy()
fig, axes = plt.subplots(2, 2, figsize=(9, 7))
panels = [
("OLS", "Y and D: Unregularized OLS"),
("Lasso", "Y and D: Lasso"),
("RF", "Y and D: Random Forest"),
("SS", "Y and D: Short-stacked"),
]
for ax, (key, title) in zip(axes.flat, panels):
rY, rD = resY_dict[key], resD_dict[key]
ax.scatter(rD, rY, color="#1f77b4", s=14)
for x, y, lab in zip(rD, rY, iso):
ax.text(x, y, str(lab), color="#1f77b4",
fontsize=7, va="bottom", ha="center")
Xc = sm.add_constant(rD)
f_k = sm.OLS(rY, Xc).fit()
xs = np.linspace(rD.min(), rD.max(), 50)
ax.plot(xs, f_k.params[0] + f_k.params[1] * xs,
color="#B1004F", linewidth=1.5)
ax.set_xlim(-0.3, 0.3); ax.set_ylim(-2, 2)
ax.set_title(title, fontsize=10, fontweight="bold")
ax.set_xlabel("Climatic instability", fontsize=9)
ax.set_ylabel("Importance of tradition", fontsize=9)
ax.grid(False)
plt.tight_layout()
plt.show()
Figure



Final model
The procedure above is suitable for “work-in-progress”.
For “final” results (e.g., for publication), a researcher should:
- Set the random number seed(s) for replicability.
- Consider using multiple cross-fit splits and aggregate them.
- Check if results are robust to the number of cross-fitting folds.
- Validate the choice of machine learner, e.g., by inspecting cross-validate loss measures or through model averaging approaches such as short-stacking.1
Cross-fitting introduces randomness by using a random split into cross-fit folds. A straightfoward way to reduce the sensitivity of the results to the split is to re-estimate using different cross-fit splits and aggregate the results. Aggregation can use the median or the mean.
The example below illustrates.
Code
// "Final" results using ddml (multiple resamples, median aggregation)
// 1. Set seed for replicability.
// 2. Use 11 separate cross-fit splits (reps = 11).
// 3. Use 10-fold cross-fitting in each rep.
// 4. Short-stacking via nnls1 in each rep (automatic in ddml).
// 5. Aggregate coefficients (mean and median).
// 6. Report stacking weights across reps.
// Reload data for fresh start
use "https://dmlguide.github.io/assets/dta/GN2021.dta", clear
gen sd_EE = sd_of_gen_mean_temp_anom_EE
drop if missing(loggdp)
// Step 1: Set seed for replicability
set seed 42
// Step 2: Initialize ddml with 10 folds and 11 reps
ddml init partial, kfolds(10) reps(11)
// Specify the model components
ddml E[Y|X]: pystacked `Y_var' `X_vars', ///
type(reg) methods(ols lassocv rf)
ddml E[D|X]: pystacked `D_var' `X_vars', ///
type(reg) methods(ols lassocv rf)
// Step 3: Cross-fit with short-stacking across all 11 reps
ddml crossfit, shortstack nostdstack
// Step 4: Estimate with median aggregation
ddml estimate, robust rep(median)
# "Final" results using ddml (multiple resamples, median aggregation)
# 1. Set seed for replicability.
# 2. Use 11 separate cross-fit splits (reps = 11).
# 3. Use 10-fold cross-fitting in each rep.
# 4. Short-stacking via nnls1 in each rep (automatic in ddml).
# 5. Aggregate coefficients (mean and median).
# 6. Report stacking weights across reps.
# Reload data for fresh start
gn <- read_dta("https://dmlguide.github.io/assets/dta/GN2021.dta")
gn$sd_EE <- gn$sd_of_gen_mean_temp_anom_EE
gn <- gn |> filter(!is.na(loggdp))
Y <- gn[[Y_var]]
D <- gn[[D_var]]
X <- as.matrix(gn[, X_vars])
# Step 1: Set seed for replicability
set.seed(42)
# Step 2: Run ddml_plm 11 times with different random splits
# Note: the R ddml package does not have a built-in reps() argument,
# so we loop manually and aggregate following Chernozhukov et al.
K <- 10
R <- 11
coef_vec <- numeric(R)
se_vec <- numeric(R)
weights_Y <- matrix(NA_real_, nrow = 3, ncol = R)
weights_D <- matrix(NA_real_, nrow = 3, ncol = R)
for (r in 1:R) {
cat(sprintf("Processing resample %d of %d...\n", r, R))
fit_r <- ddml_plm(
y = Y,
D = D,
X = X,
learners = learners,
sample_folds = 10,
ensemble_type = "nnls1",
shortstack = TRUE,
cv_folds = 5,
silent = TRUE
)
# Residualise Y and D using the short-stacked predictions, then
# estimate theta and its HC1-robust SE from the standard PLR moment.
resY <- as.numeric(Y - fit_r$fitted$y_X$cf_fitted[, "nnls1"])
resD <- as.numeric(D - fit_r$fitted$D1_X$cf_fitted[, "nnls1"])
fit_resid <- lm(resY ~ resD)
coef_vec[r] <- coef(fit_resid)["resD"]
se_vec[r] <- sqrt(diag(vcovHC(fit_resid, type = "HC1")))["resD"]
# Extract stacking weights (current ddml uses $ensemble_weights)
weights_Y[, r] <- fit_r$ensemble_weights$y_X[, "nnls1"]
weights_D[, r] <- fit_r$ensemble_weights$D1_X[, "nnls1"]
}
# Aggregate following DoubleML's median-of-CI formula
cv <- 1.96
b_med <- median(coef_vec)
se_med <- (median(coef_vec + cv*se_vec) - median(coef_vec - cv*se_vec)) / (2*cv)
b_mean <- mean(coef_vec)
se_mean <- mean(se_vec)
agg_summary <- data.frame(
spec = c("shortstack mean", "shortstack median"),
b = c(b_mean, b_med),
se = c(se_mean, se_med)
)
cat("\nMean/median aggregated short-stack results:\n")
print(agg_summary, row.names = FALSE)
# "Final" results: 11 cross-fit splits with sum-to-one NNLS short-stacking
# (OLS + CV-Lasso + RF) inside each split, then median aggregation
# following Chernozhukov et al. (2018):
# se_med = sqrt( median( se^2 + (theta - theta_med)^2 ) )
# Reload data for a fresh start.
gn = pd.read_stata("https://dmlguide.github.io/assets/dta/GN2021.dta")
gn["sd_EE"] = gn["sd_of_gen_mean_temp_anom_EE"]
gn = gn.dropna(subset=["loggdp"]).copy().reset_index(drop=True)
Y = gn[Y_var].astype(float).to_numpy()
D = gn[D_var].astype(float).to_numpy()
X = gn[X_vars].astype(float).to_numpy()
np.random.seed(42)
R = 11
def make_learners(seed):
return {
"OLS": LinearRegression(),
"Lasso": LassoCV(cv=5, random_state=seed, max_iter=10000),
"RF": RandomForestRegressor(random_state=seed),
}
def cv_predict_full(estimator, target, kf):
preds = np.empty_like(target, dtype=float)
for tr, te in kf.split(X):
est_clone = type(estimator)(**estimator.get_params())
est_clone.fit(X[tr], target[tr])
preds[te] = est_clone.predict(X[te])
return preds
coef_vec = np.empty(R)
se_vec = np.empty(R)
wY_all = np.zeros((3, R))
wD_all = np.zeros((3, R))
for r in range(R):
seed_r = np.random.randint(0, 2**31 - 1)
kf_r = KFold(n_splits=10, shuffle=True, random_state=seed_r)
lrn = make_learners(seed_r)
Yhat_r = {k: cv_predict_full(est, Y, kf_r) for k, est in lrn.items()}
Dhat_r = {k: cv_predict_full(est, D, kf_r) for k, est in lrn.items()}
Yhat_mat = np.column_stack([Yhat_r["OLS"], Yhat_r["Lasso"], Yhat_r["RF"]])
Dhat_mat = np.column_stack([Dhat_r["OLS"], Dhat_r["Lasso"], Dhat_r["RF"]])
w_Y = stack_nnls1(Yhat_mat, Y)
w_D = stack_nnls1(Dhat_mat, D)
resY = Y - Yhat_mat @ w_Y
resD = D - Dhat_mat @ w_D
fit_r = sm.OLS(resY, sm.add_constant(resD)).fit().get_robustcov_results(cov_type="HC1")
coef_vec[r] = fit_r.params[1]
se_vec[r] = fit_r.bse[1]
wY_all[:, r] = w_Y
wD_all[:, r] = w_D
# Chernozhukov median aggregation
theta_med = float(np.median(coef_vec))
se_med = float(np.sqrt(np.median(se_vec**2 + (coef_vec - theta_med)**2)))
theta_mean, se_mean = float(np.mean(coef_vec)), float(np.mean(se_vec))
agg = pd.DataFrame({"spec": ["shortstack mean", "shortstack median"],
"b": [theta_mean, theta_med],
"se": [se_mean, se_med]})
print("\nMean/median aggregated short-stack results:")
print(agg.to_string(index=False))
Output
Model: partial, crossfit folds k=10, resamples r=11
Mata global (mname): m0
Dependent variable (Y): A198new
A198new learners: Y1_pystacked
D equations (1): sd_EE
sd_EE learners: D1_pystacked
DDML estimation results:
spec r Y learner D learner b SE
ss 1 [shortstack] [ss] -2.326 (0.802)
ss 2 [shortstack] [ss] -1.907 (0.769)
ss 3 [shortstack] [ss] -2.141 (0.723)
ss 4 [shortstack] [ss] -2.058 (0.790)
ss 5 [shortstack] [ss] -2.265 (0.797)
ss 6 [shortstack] [ss] -2.108 (0.764)
ss 7 [shortstack] [ss] -2.252 (0.735)
ss 8 [shortstack] [ss] -1.892 (0.733)
ss 9 [shortstack] [ss] -1.941 (0.749)
ss 10 [shortstack] [ss] -2.231 (0.735)
ss 11 [shortstack] [ss] -1.911 (0.763)
Mean/med Y learner D learner b SE
ss mn [shortstack] [ss] -2.094 (0.774)
ss md [shortstack] [ss] -2.108 (0.767)
Shortstack DDML model (median over 11 resamples)
y-E[y|X] = y-Y_A198new_ss Number of obs = 74
D-E[D|X] = D-D_sd_EE_ss
------------------------------------------------------------------------------
| Robust
A198new | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
sd_EE | -2.108401 .7674957 -2.75 0.006 -3.612665 -.6041373
------------------------------------------------------------------------------
Stacking final estimator: nnls1
Summary over 11 resamples:
D eqn mean min p25 p50 p75 max
sd_EE -2.0937 -2.3256 -2.2521 -2.1084 -1.9105 -1.8915
Mean/median aggregated short-stack results:
spec b se
shortstack mean -2.066484 0.7916755
shortstack median -2.080807 0.7847372
Mean/median aggregated short-stack results:
spec b se
shortstack mean -2.174145 0.771204
shortstack median -2.180015 0.766703
The median results from the 11 DML estimations are again similar to the original GN results. The conclusion is again that their specification is robust to dropping the linearity assumption.
Note that the variation across the 11 refits introduced by the randomization of the cross-fit split is not trivial - the DML coefficient estimates range from about -1.7 to about -2.3 - but not enough to overturn the conclusion above. Increasing the number of refits is worth considering here.
It’s also still the case that there is evidence of nonlinearity: all 3 stacking procedures tend to put a low weight on unregularized OLS. Now the nonlinear random forest learner typically gets the biggest weight. Again, this is evidence of some nonlinearity, but not enough to overturn the OLS-based results. (Below we only show the short-stacking weights for the sake of brevity.)
Output: short-stack weights across the 11 resamples
short-stacked weights across resamples for A198new
final stacking estimator: nnls1
learner mean_weight rep_1 rep_2 rep_3 rep_4 rep_5 rep_6 rep_7 rep_8 rep_9 rep_10 rep_11
ols 1 .27806656 .25021557 .44605087 .15316251 .68824075 0 6.253e-18 .11562621 .08584223 .34507732 .49642475 .47809191
lassocv 2 .15544023 3.903e-18 .112465 .31531427 9.758e-19 .4976324 .31023966 .17962807 .28355977 0 2.602e-18 .01100336
rf 3 .56649321 .74978443 .44148413 .53152322 .31175925 .5023676 .68976034 .70474572 .630598 .65492268 .50357525 .51090473
short-stacked weights across resamples for sd_EE
final stacking estimator: nnls1
learner mean_weight rep_1 rep_2 rep_3 rep_4 rep_5 rep_6 rep_7 rep_8 rep_9 rep_10 rep_11
ols 1 2.650e-18 8.376e-18 4.498e-18 0 3.175e-18 0 2.129e-20 5.571e-18 7.513e-18 0 0 0
lassocv 2 .3317023 .37136171 .3452035 .32768063 .27627495 .3283087 .38245885 .40774363 .37568257 .26729688 .30843848 .25827544
rf 3 .6682977 .62863829 .6547965 .67231937 .72372505 .6716913 .61754115 .59225637 .62431743 .73270312 .69156152 .74172456
short-stacked weights across resamples for A198new (11 reps)
learner mean_weight rep_1 rep_2 rep_3 rep_4 rep_5 rep_6
ols ols 0.1175242 0.0000000 0.1152903 0.0000000 0.00000 0.0000000 0.2886355
lassocv lassocv 0.1290396 0.3513153 0.0000000 0.4379394 0.00000 0.0000000 0.0000000
rf rf 0.7534362 0.1435329 0.1280935 0.1165952 0.22433 0.0000000 0.3811269
rep_7 rep_8 rep_9 rep_10 rep_11
ols 0.0775156 0.2935283 0.8564671 0.8719065 0.7547676
lassocv 0.1286372 0.7756691 0.8847097 0.6188731 0.9743389
rf 0.0256611 0.6486847 0.7113645 0.4845450 0.7064717
short-stacked weights across resamples for sd_EE (11 reps)
learner mean_weight rep_1 rep_2 rep_3 rep_4 rep_5 rep_6
ols ols 0.01031469 0.0000000 0.0000000 0.0000000 0.0000000 0.0963622 0.0833967
lassocv lassocv 0.15198133 0.0000000 0.1134616 0.0000000 0.0000000 0.1697869 0.1144639
rf rf 0.83770398 0.0000000 0.0000000 0.0000000 0.1445204 0.1756197 0.2032019
rep_7 rep_8 rep_9 rep_10 rep_11
ols 0.1236532 0.1996723 0.8302131 0.8855361 0.8000375
lassocv 0.1999625 0.8554796 0.8243803 0.7967981 0.8388451
rf 0.1611549 0.9036378 0.8031417 0.8763468 0.8003277
short-stack NNLS weights across resamples for A198new (11 reps)
mean_weight rep_1 rep_2 rep_3 rep_4 rep_5 rep_6 rep_7 rep_8 rep_9 rep_10 rep_11
OLS 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Lasso 0.4637 0.5669 0.3860 0.2935 0.3451 0.4739 0.6330 0.4728 0.5867 0.3705 0.3524 0.6203
RF 0.5363 0.4331 0.6140 0.7065 0.6549 0.5261 0.3670 0.5272 0.4133 0.6295 0.6476 0.3797
short-stack NNLS weights across resamples for sd_EE (11 reps)
mean_weight rep_1 rep_2 rep_3 rep_4 rep_5 rep_6 rep_7 rep_8 rep_9 rep_10 rep_11
OLS 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Lasso 0.3369 0.3817 0.3668 0.4787 0.3582 0.3828 0.3059 0.2707 0.1738 0.2787 0.4149 0.2942
RF 0.6631 0.6183 0.6332 0.5213 0.6418 0.6172 0.6941 0.7293 0.8262 0.7213 0.5851 0.7058
-
Short-stacking stacking is computationally appealing but there are other options. Standard stacking - stacking separately for each cross-fit estimation - is also a possibility. “Pooled stacking” is similar to standard stacking except that the weights for combining learners are based on the OOS predictions for the entire sample (rather than for each cross-fit fold separately). See the discussion in Ahrens et al. (2025). ↩