Understanding Cultural Persistence and Change

In this example, we illustrate DML estimation of the partially linear model using a simple empirical example. The application is drawn from Giuliano and Nunn (2021, henceforth GN), who look at the relationship between climate instability and cultural persistence using a number of different datasets.

GN draw on evolutionary anthropology to argue that when the environment changes slowly across generations, traditions inherited from one’s ancestors carry information that is still useful today, so societies place greater weight on maintaining them. When the environment is volatile, inherited customs are less reliable guides and tradition matters less. Using paleoclimatic temperature and drought data covering 500–1900 CE, GN construct a measure of cross-generational climatic instability — defined as the the standard deviation of 20-year climate averages across 70 generations — and link it to ancestral groups via their pre-industrial locations.

Estimation Strategy

Their first estimation strategy uses cross-country regressions where the dependent variable ($Y$) is a measure of the importance of tradition taken from the World Values Survey, and the causal variable of interest ($D$) is a measure of ancestral climatic instability. The dataset is quite small: only 74 countries.

The example is useful for illustrating how DML works for several reasons: the dataset is small and available online, visualization is easy, reproduction of results is straightforward, and the example shows how DML can be used as a robustness check even in the simplest of settings. The GN dataset used for this demonstration is available here.

GN are concerned about omitted confounders, and include 4 controls to address the issue. The outcome, treatment and control variables are:

Variable Description
A198new The outcome variable of interest: country-level average of the self-reported importance of tradition. Ranges from 1 to 6 (bigger=more important).
sd_EE The causal variable of interest: a measure of ancestral climatic instability (standard deviation of temperature anomaly measure across generations; see GN for details).
v104_ee Control #1: distance from the equator.
settlement_ee Control #2: early economic development (proxied by complexity of settlements).
polhierarchies_ee Control #3: political hierarchies (a measure of political centralization).
loggdp Control #4: log GDP per capita in the country of origin at the time of the survey.

Their model with controls is one where the controls enter linearly and is estimated using OLS. The effect of climatic instability on the importance of tradition is negative, with a coefficient that is different from zero at conventional significance levels.

Replication

We start by reproducing GN’s Figure 5 — a simple scatterplot of the bivariate relationship — and then add the four control variables to reproduce column (4).

GN’s Figure 5 visualizes the unconditional relationship between climatic instability and the importance of tradition. The fitted line is the OLS coefficient from column (3): $-1.92$ ($s.e.=0.523$).

Code
// Figure 5 visualises the bivariate (no-controls) relationship — column (3).
reg A198new sd_EE, robust
predict hat
twoway (scatter A198new sd_EE, msymbol(circle) msize(medium)) ///
       (line hat sd_EE, sort lwidth(medium)), ///
       xlabel(0(0.25)0.5) ylabel(3(1)6) ///
       xtitle("Climatic instability") ytitle("Importance of tradition") ///
       legend(off) ///
       text(2.85 0 "(coef = -1.92, t = -3.68)", placement(east))

drop hat

# Figure 5 visualises the bivariate (no-controls) relationship — column (3).
model_bivariate <- lm(A198new ~ sd_EE, data = gn)
gn$hat <- fitted(model_bivariate)

p_fig5 <- ggplot(gn, aes(x = sd_EE, y = A198new, label = isocode)) +
  geom_point(color = "#1f77b4", size = 2) +
  geom_text(color = "#1f77b4", vjust = -0.6, size = 3) +
  geom_line(aes(y = hat), linewidth = 1.2, color = "#B1004F") +
  scale_x_continuous(breaks = c(0, 0.25, 0.5), limits = c(0, 0.5)) +
  scale_y_continuous(breaks = c(3, 4, 5, 6)) +
  coord_cartesian(ylim = c(2.8, 6)) +
  labs(x = "Climatic instability", y = "Importance of tradition") +
  theme_minimal() +
  theme(
    panel.grid      = element_blank(),
    legend.position = "none",
    axis.title      = element_text(size = 12),
    axis.text       = element_text(size = 10)
  ) +
  annotate("text", x = 0, y = 2.85,
           label = "(coef = -1.92, t = -3.68)", hjust = 0, size = 4)

print(p_fig5)
gn$hat <- NULL

# Figure 5 visualises the bivariate (no-controls) relationship — column (3).
X_bi = sm.add_constant(gn[["sd_EE"]], has_constant="add")
model_bivariate = sm.OLS(gn["A198new"].astype(float), X_bi).fit()
gn["hat"] = model_bivariate.predict(X_bi)

fig, ax = plt.subplots(figsize=(7, 5))
ax.scatter(gn["sd_EE"], gn["A198new"], s=30)

for _, row in gn.iterrows():
    ax.text(row["sd_EE"], row["A198new"], str(row["isocode"]),
            fontsize=8, va="bottom", ha="center")

tmp = gn.sort_values("sd_EE")
ax.plot(tmp["sd_EE"], tmp["hat"], linewidth=2)

ax.set_xlim(0, 0.5)
ax.set_xticks([0, 0.25, 0.5])
ax.set_ylim(2.8, 6.0)
ax.set_yticks([3, 4, 5, 6])
ax.set_xlabel("Climatic instability")
ax.set_ylabel("Importance of tradition")
ax.grid(False)
ax.text(0, 2.85, "(coef = -1.92, t = -3.68)", fontsize=10, ha="left", va="bottom")

plt.tight_layout()
plt.show()
gn = gn.drop(columns=["hat"])

Figure

GN Figure 5

GN Figure 5 (R)

GN Figure 5 (Python)

Adding the four control variables yields GN’s headline column-(4) estimate of $-1.824$ ($s.e.=0.696$). They interpret the result as follows (p. 155):

Based on the estimates from column 4, a one-standard-deviation increase in cross-generational instability (0.11) is associated with a reduction in the tradition index of 1.824×0.11=0.20, which is 36% of a standard deviation of the tradition variable.

Code
clear all
set more off
set seed 42

// Load data
use "https://dmlguide.github.io/assets/dta/GN2021.dta", clear

// Shorten the name of the causal variable of interest to sd_EE
gen sd_EE = sd_of_gen_mean_temp_anom_EE

// Replicate Table 1, column (4): OLS with the 4 controls, HC1 robust SE.
// One country has loggdp missing; reg drops it automatically (N = 74).
reg A198new sd_EE v104_ee settlement_ee polhierarchies_ee loggdp, robust

set.seed(42)

library(haven)
library(dplyr)
library(sandwich)
library(lmtest)
library(ggplot2)
library(ddml)
library(ranger)
library(glmnet)

# Load data
gn <- read_dta("https://dmlguide.github.io/assets/dta/GN2021.dta")

# Shorten the name of the causal variable of interest to sd_EE
gn$sd_EE <- gn$sd_of_gen_mean_temp_anom_EE

# Replicate Table 1, column (4): OLS with the 4 controls, HC1 robust SE.
# lm() drops the country with missing loggdp automatically (N = 74).
model_full <- lm(A198new ~ sd_EE + v104_ee + settlement_ee +
                          polhierarchies_ee + loggdp, data = gn)
coeftest(model_full, vcov = vcovHC(model_full, type = "HC1"))

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Load data
gn = pd.read_stata("https://dmlguide.github.io/assets/dta/GN2021.dta")

# Shorten the name of the causal variable of interest to sd_EE
gn["sd_EE"] = gn["sd_of_gen_mean_temp_anom_EE"]

# Replicate Table 1, column (4): OLS with the 4 controls, HC1 robust SE.
# Drop the country with missing loggdp (N goes from 75 to 74).
gn_full = gn.dropna(subset=["loggdp"]).copy()
controls = ["v104_ee", "settlement_ee", "polhierarchies_ee", "loggdp"]

X_full = sm.add_constant(gn_full[["sd_EE"] + controls], has_constant="add")
y_full = gn_full["A198new"].astype(float)

model_full        = sm.OLS(y_full, X_full).fit()
model_full_robust = model_full.get_robustcov_results(cov_type="HC1")

print(model_full_robust.summary())

Output
Linear regression                               Number of obs     =         74
                                                F(5, 68)          =       7.63
                                                Prob > F          =     0.0000
                                                R-squared         =     0.3876
                                                Root MSE          =      .4485

------------------------------------------------------------------------------
             |               Robust
     A198new | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       sd_EE |  -1.823516   .6958312    -2.62   0.011    -3.212026   -.4350068
     v104_ee |   .0054164   .0053679     1.01   0.317    -.0052951    .0161279
settlemen~ee |  -.0651081   .0347562    -1.87   0.065     -.134463    .0042467
polhierar~ee |   .0127799   .0971405     0.13   0.896     -.181061    .2066208
      loggdp |  -.1648961   .0486385    -3.39   0.001    -.2619527   -.0678394
       _cons |   6.576112   .5359645    12.27   0.000     5.506611    7.645613
------------------------------------------------------------------------------

t test of coefficients:

                    Estimate Std. Error t value  Pr(>|t|)
(Intercept)        6.5761120  0.5359645 12.2697 < 2.2e-16 ***
sd_EE             -1.8235164  0.6958312 -2.6206  0.010817 *
v104_ee            0.0054164  0.0053679  1.0090  0.316534
settlement_ee     -0.0651081  0.0347562 -1.8733  0.065328 .
polhierarchies_ee  0.0127799  0.0971405  0.1316  0.895720
loggdp            -0.1648961  0.0486385 -3.3902  0.001166 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                            OLS Regression Results
==============================================================================
Dep. Variable:                A198new   R-squared:                       0.388
Model:                            OLS   Adj. R-squared:                  0.343
Method:                 Least Squares   F-statistic:                     7.626
Date:                Tue, 02 Jun 2026   Prob (F-statistic):           9.77e-06
Time:                        22:49:21   Log-Likelihood:                -42.536
No. Observations:                  74   AIC:                             97.07
Df Residuals:                      68   BIC:                             110.9
Df Model:                           5
Covariance Type:                  HC1
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 6.5761      0.536     12.270      0.000       5.507       7.646
sd_EE                -1.8235      0.696     -2.621      0.011      -3.212      -0.435
v104_ee               0.0054      0.005      1.009      0.317      -0.005       0.016
settlement_ee        -0.0651      0.035     -1.873      0.065      -0.134       0.004
polhierarchies_ee     0.0128      0.097      0.132      0.896      -0.181       0.207
loggdp               -0.1649      0.049     -3.390      0.001      -0.262      -0.068
==============================================================================
Omnibus:                        3.680   Durbin-Watson:                   1.959
Prob(Omnibus):                  0.159   Jarque-Bera (JB):                3.823
Skew:                          -0.118   Prob(JB):                        0.148
Kurtosis:                       4.088   Cond. No.                         480.
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC1)

DML as a robustness check

A useful robustness check is to ask whether the results with controls are sensitive to the assumption that the controls enter linearly. DML can help answer this question.

Below we estimate the model with controls using DML and the partially-linear model (PLM). The specification is still very simple: we keep the assumption that the controls enter separately from the treatment, but we drop linearity.

The estimation procedure takes the following steps:

  1. Specify the target estimand: in this case, the PLM coefficient.
  2. Specify the learners: choose the learners for the conditional expectation functions for the outcome variable and the causal variable of interest.
  3. Cross-fitting: obtain cross-fit estimates of the two conditional expectations.
  4. Estimation of structural parameter: regress the residualized outcome variable on the residualized causal variable of interest.

We have no strong priors on whether a linear or non-linear learner is more appropriate, or whether regularization is needed at all. Hence we use 3 learners in this example:

  • unregularized OLS,
  • cross-validated Lasso,
  • and a random forest.

We also use model averaging to combine the estimated conditional expectations from these 3 learners. Specifically, we use short-stacking which is a computationally cheap and fast way of pairing DML and model averaging (see Ahrens et al. (2025) for a description of the algorithm and other model averaging strategies). Also to make the example run quickly the initial estimation is done just once, i.e., we do not resample (repeat the cross-fit based on different splits).

The dataset is small, and so we choose 10-fold cross-fitting. This means that learners are trained on about 66-67 observations, and OOS predictions are obtained for the remaining 7-8 observations. (If the dataset were larger, we could also consider a smaller number of folds to save computational time.)

Code
// In the GN estimations, GDP is missing for one country.
// For simplicity, we just drop this one country so that it is never used.
drop if missing(loggdp)

// To make the code more readable, define locals for Y, D and X:
local Y_var "A198new"
local D_var "sd_EE"
local X_vars "v104_ee settlement_ee polhierarchies_ee loggdp"

// In Stata, we now use ddml with pystacked for OLS, cross-validated lasso,
// and random forest learners combined via short-stacking (nnls1).

// Step 1: Initialize ddml for partially linear model
ddml init partial, kfolds(10)

// Step 2: Specify the model components
// Outcome equation
ddml E[Y|X]: pystacked `Y_var' `X_vars', ///
    type(reg) methods(ols lassocv rf)

// Treatment equation
ddml E[D|X]: pystacked `D_var' `X_vars', ///
    type(reg) methods(ols lassocv rf)

// Step 3: Cross-fit with short-stacking
// Set seed for this single split
set seed 12345
ddml crossfit, shortstack nostdstack

// Step 4: Estimate the partially linear model
ddml estimate, robust

# In the GN estimations, GDP is missing for one country.
# For simplicity, we just drop this one country so that it is never used.
gn <- gn |> filter(!is.na(loggdp))

# To make the code more readable, define locals for Y, D and X:
Y_var  <- "A198new"
D_var  <- "sd_EE"
X_vars <- c("v104_ee", "settlement_ee", "polhierarchies_ee", "loggdp")

# In R, we now use ddml with learners for OLS, cross-validated lasso,
# and random forest combined via short-stacking (nnls1).

# Prepare data matrices
Y <- gn[[Y_var]]
D <- gn[[D_var]]
X <- as.matrix(gn[, X_vars])

# Step 1: Specify learners for partially linear model
learners <- list(
  list(fun = ols),
  list(fun = mdl_glmnet, args = list(alpha = 1, nfolds = 5)),  # lasso (5-fold CV)
  list(fun = mdl_ranger, args = list(num.trees = 500, min.node.size = 5))
)

# Step 2: Set seed for this single split
set.seed(12345)

# Step 3: Cross-fit with short-stacking
ddml_fit <- ddml_plm(
  y             = Y,
  D             = D,
  X             = X,
  learners      = learners,
  sample_folds  = 10,
  ensemble_type = "nnls1",
  shortstack    = TRUE,
  cv_folds      = 5,
  silent        = FALSE
)

# Step 4: Estimate the partially linear model
summary(ddml_fit, type = "HC1")

# DoubleML does not natively implement short-stacking, so we hand-roll
# the cross-fitting with sklearn and combine the per-learner OOF
# predictions via non-negative least squares constrained to sum to one
# (the "nnls1" stacking used by ddml / pystacked).

import numpy as np
import pandas as pd
import statsmodels.api as sm

from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from scipy.optimize import minimize


def stack_nnls1(A, b):
    """Solve min ||A w - b||^2 subject to w >= 0 and sum(w) = 1."""
    k = A.shape[1]
    res = minimize(lambda w: np.sum((A @ w - b) ** 2),
                   x0=np.full(k, 1.0 / k),
                   method="SLSQP",
                   bounds=[(0.0, 1.0)] * k,
                   constraints=[{"type": "eq", "fun": lambda w: w.sum() - 1.0}],
                   options={"ftol": 1e-12})
    return res.x


# In the GN estimations, GDP is missing for one country.
# For simplicity, we drop this one country so that it is never used.
gn = gn.dropna(subset=["loggdp"]).copy().reset_index(drop=True)

# To make the code more readable, define variables for Y, D and X:
Y_var  = "A198new"
D_var  = "sd_EE"
X_vars = ["v104_ee", "settlement_ee", "polhierarchies_ee", "loggdp"]

Y = gn[Y_var].astype(float).to_numpy()
D = gn[D_var].astype(float).to_numpy()
X = gn[X_vars].astype(float).to_numpy()

# Set seed for replicability.
np.random.seed(12345)

# Step 1: Define the three candidate learners — OLS, CV-Lasso, RandomForest.
learners = {
    "OLS":   lambda: LinearRegression(),
    "Lasso": lambda: LassoCV(cv=5, random_state=12345, max_iter=10000),
    "RF":    lambda: RandomForestRegressor(random_state=12345),
}

# Step 2: 10-fold OOF predictions, same fold split for every learner.
kf = KFold(n_splits=10, shuffle=True, random_state=12345)


def cv_predict(make_estimator, target):
    preds = np.empty_like(target, dtype=float)
    for tr, te in kf.split(X):
        est = make_estimator()
        est.fit(X[tr], target[tr])
        preds[te] = est.predict(X[te])
    return preds


Yhat = {name: cv_predict(make, Y) for name, make in learners.items()}
Dhat = {name: cv_predict(make, D) for name, make in learners.items()}

# Step 3: Short-stack via sum-to-one constrained NNLS on the OOF predictions.
Yhat_mat = np.column_stack([Yhat["OLS"], Yhat["Lasso"], Yhat["RF"]])
Dhat_mat = np.column_stack([Dhat["OLS"], Dhat["Lasso"], Dhat["RF"]])
w_Y = stack_nnls1(Yhat_mat, Y)
w_D = stack_nnls1(Dhat_mat, D)
Yhat_ss = Yhat_mat @ w_Y
Dhat_ss = Dhat_mat @ w_D

# Step 4: Estimate theta from the residualised OLS, HC1 robust SE.
resY = Y - Yhat_ss
resD = D - Dhat_ss
fit  = sm.OLS(resY, sm.add_constant(resD)).fit().get_robustcov_results(cov_type="HC1")

print(f"theta = {fit.params[1]:.4f}  (SE {fit.bse[1]:.4f})")

Output
Model:                  partial, crossfit folds k=10, resamples r=1
Mata global (mname):    m0
Dependent variable (Y): A198new
 A198new learners:      Y1_pystacked
D equations (1):        sd_EE
 sd_EE learners:        D1_pystacked

DDML estimation results:
spec  r     Y learner     D learner         b        SE 
  ss  1  [shortstack]          [ss]    -2.338    (0.805)

Shortstack DDML model
y-E[y|X]  = y-Y_A198new_ss_1                       Number of obs   =        74
D-E[D|X]  = D-D_sd_EE_ss_1 
------------------------------------------------------------------------------
             |               Robust
     A198new | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       sd_EE |  -2.338472   .8052145    -2.90   0.004    -3.916663   -.7602802
       _cons |  -.0069149   .0521399    -0.13   0.894    -.1091072    .0952774
------------------------------------------------------------------------------
Stacking final estimator: nnls1

DDML estimation: Partially Linear Model
Obs: 74   Folds: 10  Stacking: short-stack

            Estimate Std. Error z value Pr(>|z|)
D1          -2.06604    0.85642   -2.41    0.016 *
(Intercept)  0.00281    0.05339    0.05    0.958
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

theta = -2.3267  (SE 0.7718)

Typically the DML short-stacked results are quite close to the original linear specification used by GN. A natural interpretation is that the GN results still stand in this robustness check. This does not mean, however, that the linear specification was the “right” choice.

The stacking weights (shown below) provide insights on the relative importance of linear vs non-linear learners: Depending on the randomization in the cross-fit split and the learners, the short-stacking weights typically put a low weight on unregularized OLS. The random forest learner also typically get a substantial weight in both of the conditional expectation estimates. This suggests that some nonlinearity is present, but that the assumption of linearity does not substantially affect the results.

Code
// Display the short-stack weights.
ddml extract, show(ssweights)

# Display short-stack weights from the single-split fit.
print(ddml_fit$ensemble_weights)

# w_Y, w_D are the NNLS weights from the hand-rolled short-stack above.
print("Short-stack NNLS weights (Y eq):  "
      f"OLS={w_Y[0]:.3f}, Lasso={w_Y[1]:.3f}, RF={w_Y[2]:.3f}")
print("Short-stack NNLS weights (D eq):  "
      f"OLS={w_D[0]:.3f}, Lasso={w_D[1]:.3f}, RF={w_D[2]:.3f}")

Output
short-stacked weights across resamples for A198new
final stacking estimator: nnls1
             learner  mean_weight        rep_1
    ols            1            0            0
lassocv            2    .30571302    .30571302
     rf            3    .69428698    .69428698

short-stacked weights across resamples for sd_EE
final stacking estimator: nnls1
             learner  mean_weight        rep_1
    ols            1    2.901e-18    2.901e-18
lassocv            2    .36439938    .36439938
     rf            3    .63560062    .63560062

$y_X
             nnls1
[1,] -4.790797e-17
[2,]  2.209719e-01
[3,]  7.790281e-01

$D1_X
             nnls1
[1,] -2.759759e-17
[2,]  2.180738e-01
[3,]  7.819262e-01

Short-stack NNLS weights (Y eq):  OLS=0.000, Lasso=0.525, RF=0.475
Short-stack NNLS weights (D eq):  OLS=0.000, Lasso=0.335, RF=0.665

Results by learner

We now inspect the DML results by learner. We use the same learner for both estimating the outcome and treatment equation. But we could also consider combinations where we use different learners by equation.

Estimation table

Code
// ddml creates conditional expectations for each learner.
// With pystacked, the learners are numbered.
// The final "_1" identifies the resample (the single cross-fit split).
// Globals used by the residual-construction lines below.
global Y A198new
global D sd_EE

// Create residualized Y and D for each learner.
cap drop Y_L1_r Y_L2_r Y_L3_r Y_ss_r D_L1_r D_L2_r D_L3_r D_ss_r
gen Y_L1_r = $Y-Y1_pystacked_L1_1
gen Y_L2_r = $Y-Y1_pystacked_L2_1
gen Y_L3_r = $Y-Y1_pystacked_L3_1
gen Y_ss_r = $Y-Y_A198new_ss_1
gen D_L1_r = $D-D1_pystacked_L1_1
gen D_L2_r = $D-D1_pystacked_L2_1
gen D_L3_r = $D-D1_pystacked_L3_1
gen D_ss_r = $D-D_sd_EE_ss_1

// Per-learner residual regressions; store for the comparison table.
qui reg Y_L1_r D_L1_r, robust
est store ddml_L1_L1, title("OLS learners")
qui reg Y_L2_r D_L2_r, robust
est store ddml_L2_L2, title("Lasso learners")
qui reg Y_L3_r D_L3_r, robust
est store ddml_L3_L3, title("RF learners")
qui reg Y_ss_r D_ss_r, robust
est store ddml_SS_SS, title("SS learners")

// Compare DML estimations for learner combinations:
// nb: Code uses estout by Ben Jann; install from SSC Archives.
set linesize 200
estout ddml_L1_L1 ddml_L2_L2 ddml_L3_L3 ddml_SS_SS,					///
	label modelwidth(15) collabels(none)							///
	rename (D_L1_r D_r D_L2_r D_r D_L3_r D_r D_ss_r D_r)			///
	cells(b(fmt(3)) se(par fmt(3)))									///
	varlabels(_cons Constant)

# The ddml_plm fit object exposes per-learner OOF residuals via
#   fit$fitted$y_X$cf_resid_bylearner    (n x #learners)
#   fit$fitted$D1_X$cf_resid_bylearner
# and the short-stacked predictions via $cf_fitted.

Y_resids <- list(
  OLS   = ddml_fit$fitted$y_X$cf_resid_bylearner[, 1],
  Lasso = ddml_fit$fitted$y_X$cf_resid_bylearner[, 2],
  RF    = ddml_fit$fitted$y_X$cf_resid_bylearner[, 3],
  SS    = as.numeric(Y - ddml_fit$fitted$y_X$cf_fitted[, "nnls1"])
)
D_resids <- list(
  OLS   = ddml_fit$fitted$D1_X$cf_resid_bylearner[, 1],
  Lasso = ddml_fit$fitted$D1_X$cf_resid_bylearner[, 2],
  RF    = ddml_fit$fitted$D1_X$cf_resid_bylearner[, 3],
  SS    = as.numeric(D - ddml_fit$fitted$D1_X$cf_fitted[, "nnls1"])
)

# Comparison table: HC1 robust SE per learner
build_row <- function(key) {
  est <- lm(Y_resids[[key]] ~ D_resids[[key]])
  ct  <- coeftest(est, vcov = vcovHC(est, type = "HC1"))
  data.frame(
    spec      = key,
    b         = unname(ct[2, "Estimate"]),
    se_d      = unname(ct[2, "Std. Error"]),
    intercept = unname(ct[1, "Estimate"]),
    se_int    = unname(ct[1, "Std. Error"])
  )
}
tab <- do.call(rbind, lapply(c("OLS", "Lasso", "RF", "SS"), build_row))
print(tab, row.names = FALSE)

Yhat["SS"] = Yhat_mat @ w_Y
Dhat["SS"] = Dhat_mat @ w_D

resY_dict = {k: Y - Yhat[k] for k in Yhat}
resD_dict = {k: D - Dhat[k] for k in Dhat}

rows = []
for key in ["OLS", "Lasso", "RF", "SS"]:
    Xc  = sm.add_constant(resD_dict[key])
    f_k = sm.OLS(resY_dict[key], Xc).fit().get_robustcov_results(cov_type="HC1")
    rows.append({"spec": key,
                 "b": f_k.params[1], "se_d": f_k.bse[1],
                 "intercept": f_k.params[0], "se_int": f_k.bse[0]})

tbl = pd.DataFrame(rows)
print(tbl.to_string(index=False))

Output
------------------------------------------------------------------------------------
                        OLS learners  Lasso learners     RF learners     SS learners
------------------------------------------------------------------------------------
D_r                           -1.866          -1.892          -2.265          -2.338
                             (0.705)         (0.694)         (0.741)         (0.805)
Constant                       0.002           0.003          -0.011          -0.007
                             (0.056)         (0.055)         (0.053)         (0.052)
------------------------------------------------------------------------------------

DML estimates by learner (resid_Y ~ resid_D, HC1 robust SE):
  spec         b      se_d    intercept     se_int
   OLS -1.744850 0.7226074 -0.010256185 0.05786292
 Lasso -1.713238 0.7147755  0.001572231 0.05691008
    RF -2.126607 0.8484537  0.002987867 0.05330342
    SS -2.066040 0.8564242  0.002813695 0.05338584

DML estimates by learner (resid_Y ~ resid_D, HC1 robust SE):
 spec         b     se_d  intercept   se_int
  OLS -1.851188 0.684268  -0.005177 0.054723
Lasso -1.871904 0.689664  -0.004357 0.053601
   RF -2.353444 0.740676  -0.003409 0.052612
   SS -2.326678 0.771804  -0.004310 0.051630

Scatterplots

The GN dataset is very small. This makes it easy to visualize the conditional relationship by plotting the residualized outcome and residualized treatment, where residualization uses one of the learners.

Code
// Restore each stored estimate so predict() uses the right regression,
// then plot residualized Y vs residualized D with the fitted line.
estimates restore ddml_L1_L1
cap drop hat_1_1
predict hat_1_1
twoway (scatter Y_L1_r D_L1_r, msymbol(o) mlabel(isocode))			///
	(line hat_1_1 D_L1_r, sort lwidth(thick)),					 	///
	legend(off) xlabel(-0.3 -0.2 -0.1 0 0.1 0.2 0.3, nogrid)		///
	ylabel(-2 -1 0 1 2, nogrid)										///
	title("Y and D: Unregularized OLS")								///
	ytitle("Importance of tradition")								///
	xtitle("Climatic instability") name(OLS, replace)

estimates restore ddml_L2_L2
cap drop hat_2_2
predict hat_2_2
twoway (scatter Y_L2_r D_L2_r, msymbol(o) mlabel(isocode))			///
	(line hat_2_2 D_L2_r, sort lwidth(thick)),					 	///
	legend(off) xlabel(-0.3 -0.2 -0.1 0 0.1 0.2 0.3, nogrid)		///
	ylabel(-2 -1 0 1 2, nogrid)										///
	title("Y and D: Lasso")											///
	ytitle("Importance of tradition")								///
	xtitle("Climatic instability") name(Lasso, replace)

estimates restore ddml_L3_L3
cap drop hat_3_3
predict hat_3_3
twoway (scatter Y_L3_r D_L3_r, msymbol(o) mlabel(isocode))			///
	(line hat_3_3 D_L3_r, sort lwidth(thick)),					 	///
	legend(off) xlabel(-0.3 -0.2 -0.1 0 0.1 0.2 0.3, nogrid)		///
	ylabel(-2 -1 0 1 2, nogrid)										///
	title("Y and D: Random Forest")									///
	ytitle("Importance of tradition")								///
	xtitle("Climatic instability") name(RF, replace)

estimates restore ddml_SS_SS
cap drop hat_ss_ss
predict hat_ss_ss
twoway (scatter Y_ss_r D_ss_r, msymbol(o) mlabel(isocode))			///
	(line hat_ss_ss D_ss_r, sort lwidth(thick)),				 	///
	legend(off) xlabel(-0.3 -0.2 -0.1 0 0.1 0.2 0.3, nogrid)		///
	ylabel(-2 -1 0 1 2, nogrid)										///
	title("Y and D: Short-stacked")									///
	ytitle("Importance of tradition")								///
	xtitle("Climatic instability") name(SS, replace)

graph combine OLS Lasso RF SS

library(patchwork)

iso <- gn$isocode

panel_titles <- c(OLS = "Y and D: Unregularized OLS",
                  Lasso = "Y and D: Lasso",
                  RF = "Y and D: Random Forest",
                  SS = "Y and D: Short-stacked")

make_panel <- function(key) {
  df <- data.frame(y = Y_resids[[key]], d = D_resids[[key]], iso = iso)
  est <- lm(y ~ d, data = df)
  df$hat <- fitted(est)
  ggplot(df, aes(x = d, y = y, label = iso)) +
    geom_point(color = "#1f77b4", size = 1.6) +
    geom_text(color = "#1f77b4", vjust = -0.6, size = 2.6) +
    geom_line(aes(y = hat), linewidth = 1.0, color = "#B1004F") +
    scale_x_continuous(breaks = seq(-0.3, 0.3, 0.1), limits = c(-0.3, 0.3)) +
    scale_y_continuous(breaks = seq(-2, 2, 1), limits = c(-2, 2)) +
    labs(title = panel_titles[[key]],
         x = "Climatic instability",
         y = "Importance of tradition") +
    theme_minimal() +
    theme(panel.grid = element_blank(),
          plot.title = element_text(size = 10, face = "bold"),
          axis.title = element_text(size = 9),
          axis.text  = element_text(size = 8))
}

p_all <- (make_panel("OLS") | make_panel("Lasso")) /
         (make_panel("RF")  | make_panel("SS"))
print(p_all)

import matplotlib.pyplot as plt

iso = gn["isocode"].to_numpy()

fig, axes = plt.subplots(2, 2, figsize=(9, 7))
panels = [
    ("OLS",   "Y and D: Unregularized OLS"),
    ("Lasso", "Y and D: Lasso"),
    ("RF",    "Y and D: Random Forest"),
    ("SS",    "Y and D: Short-stacked"),
]
for ax, (key, title) in zip(axes.flat, panels):
    rY, rD = resY_dict[key], resD_dict[key]
    ax.scatter(rD, rY, color="#1f77b4", s=14)
    for x, y, lab in zip(rD, rY, iso):
        ax.text(x, y, str(lab), color="#1f77b4",
                fontsize=7, va="bottom", ha="center")
    Xc  = sm.add_constant(rD)
    f_k = sm.OLS(rY, Xc).fit()
    xs  = np.linspace(rD.min(), rD.max(), 50)
    ax.plot(xs, f_k.params[0] + f_k.params[1] * xs,
            color="#B1004F", linewidth=1.5)
    ax.set_xlim(-0.3, 0.3); ax.set_ylim(-2, 2)
    ax.set_title(title, fontsize=10, fontweight="bold")
    ax.set_xlabel("Climatic instability", fontsize=9)
    ax.set_ylabel("Importance of tradition", fontsize=9)
    ax.grid(False)

plt.tight_layout()
plt.show()

Figure

GN Four learners

GN four learners (R)

GN four learners (Python)

Final model

The procedure above is suitable for “work-in-progress”.

For “final” results (e.g., for publication), a researcher should:

  1. Set the random number seed(s) for replicability.
  2. Consider using multiple cross-fit splits and aggregate them.
  3. Check if results are robust to the number of cross-fitting folds.
  4. Validate the choice of machine learner, e.g., by inspecting cross-validate loss measures or through model averaging approaches such as short-stacking.1

Cross-fitting introduces randomness by using a random split into cross-fit folds. A straightfoward way to reduce the sensitivity of the results to the split is to re-estimate using different cross-fit splits and aggregate the results. Aggregation can use the median or the mean.

The example below illustrates.

Code
// "Final" results using ddml (multiple resamples, median aggregation)
// 1. Set seed for replicability.
// 2. Use 11 separate cross-fit splits (reps = 11).
// 3. Use 10-fold cross-fitting in each rep.
// 4. Short-stacking via nnls1 in each rep (automatic in ddml).
// 5. Aggregate coefficients (mean and median).
// 6. Report stacking weights across reps.

// Reload data for fresh start
use "https://dmlguide.github.io/assets/dta/GN2021.dta", clear
gen sd_EE = sd_of_gen_mean_temp_anom_EE
drop if missing(loggdp)

// Step 1: Set seed for replicability
set seed 42

// Step 2: Initialize ddml with 10 folds and 11 reps
ddml init partial, kfolds(10) reps(11)

// Specify the model components
ddml E[Y|X]: pystacked `Y_var' `X_vars', ///
    type(reg) methods(ols lassocv rf)

ddml E[D|X]: pystacked `D_var' `X_vars', ///
    type(reg) methods(ols lassocv rf)

// Step 3: Cross-fit with short-stacking across all 11 reps
ddml crossfit, shortstack nostdstack

// Step 4: Estimate with median aggregation
ddml estimate, robust rep(median)

# "Final" results using ddml (multiple resamples, median aggregation)
# 1. Set seed for replicability.
# 2. Use 11 separate cross-fit splits (reps = 11).
# 3. Use 10-fold cross-fitting in each rep.
# 4. Short-stacking via nnls1 in each rep (automatic in ddml).
# 5. Aggregate coefficients (mean and median).
# 6. Report stacking weights across reps.

# Reload data for fresh start
gn <- read_dta("https://dmlguide.github.io/assets/dta/GN2021.dta")
gn$sd_EE <- gn$sd_of_gen_mean_temp_anom_EE
gn <- gn |> filter(!is.na(loggdp))

Y <- gn[[Y_var]]
D <- gn[[D_var]]
X <- as.matrix(gn[, X_vars])

# Step 1: Set seed for replicability
set.seed(42)

# Step 2: Run ddml_plm 11 times with different random splits
# Note: the R ddml package does not have a built-in reps() argument,
# so we loop manually and aggregate following Chernozhukov et al.

K <- 10
R <- 11

coef_vec  <- numeric(R)
se_vec    <- numeric(R)
weights_Y <- matrix(NA_real_, nrow = 3, ncol = R)
weights_D <- matrix(NA_real_, nrow = 3, ncol = R)

for (r in 1:R) {
  cat(sprintf("Processing resample %d of %d...\n", r, R))

  fit_r <- ddml_plm(
    y             = Y,
    D             = D,
    X             = X,
    learners      = learners,
    sample_folds  = 10,
    ensemble_type = "nnls1",
    shortstack    = TRUE,
    cv_folds      = 5,
    silent        = TRUE
  )

  # Residualise Y and D using the short-stacked predictions, then
  # estimate theta and its HC1-robust SE from the standard PLR moment.
  resY <- as.numeric(Y - fit_r$fitted$y_X$cf_fitted[, "nnls1"])
  resD <- as.numeric(D - fit_r$fitted$D1_X$cf_fitted[, "nnls1"])
  fit_resid    <- lm(resY ~ resD)
  coef_vec[r]  <- coef(fit_resid)["resD"]
  se_vec[r]    <- sqrt(diag(vcovHC(fit_resid, type = "HC1")))["resD"]

  # Extract stacking weights (current ddml uses $ensemble_weights)
  weights_Y[, r] <- fit_r$ensemble_weights$y_X[, "nnls1"]
  weights_D[, r] <- fit_r$ensemble_weights$D1_X[, "nnls1"]
}

# Aggregate following DoubleML's median-of-CI formula
cv        <- 1.96
b_med     <- median(coef_vec)
se_med    <- (median(coef_vec + cv*se_vec) - median(coef_vec - cv*se_vec)) / (2*cv)
b_mean    <- mean(coef_vec)
se_mean   <- mean(se_vec)

agg_summary <- data.frame(
  spec = c("shortstack mean", "shortstack median"),
  b    = c(b_mean, b_med),
  se   = c(se_mean, se_med)
)
cat("\nMean/median aggregated short-stack results:\n")
print(agg_summary, row.names = FALSE)

# "Final" results: 11 cross-fit splits with sum-to-one NNLS short-stacking
# (OLS + CV-Lasso + RF) inside each split, then median aggregation
# following Chernozhukov et al. (2018):
#   se_med = sqrt( median( se^2 + (theta - theta_med)^2 ) )

# Reload data for a fresh start.
gn = pd.read_stata("https://dmlguide.github.io/assets/dta/GN2021.dta")
gn["sd_EE"] = gn["sd_of_gen_mean_temp_anom_EE"]
gn = gn.dropna(subset=["loggdp"]).copy().reset_index(drop=True)

Y = gn[Y_var].astype(float).to_numpy()
D = gn[D_var].astype(float).to_numpy()
X = gn[X_vars].astype(float).to_numpy()

np.random.seed(42)
R = 11


def make_learners(seed):
    return {
        "OLS":   LinearRegression(),
        "Lasso": LassoCV(cv=5, random_state=seed, max_iter=10000),
        "RF":    RandomForestRegressor(random_state=seed),
    }


def cv_predict_full(estimator, target, kf):
    preds = np.empty_like(target, dtype=float)
    for tr, te in kf.split(X):
        est_clone = type(estimator)(**estimator.get_params())
        est_clone.fit(X[tr], target[tr])
        preds[te] = est_clone.predict(X[te])
    return preds


coef_vec = np.empty(R)
se_vec   = np.empty(R)
wY_all   = np.zeros((3, R))
wD_all   = np.zeros((3, R))

for r in range(R):
    seed_r = np.random.randint(0, 2**31 - 1)
    kf_r = KFold(n_splits=10, shuffle=True, random_state=seed_r)
    lrn  = make_learners(seed_r)

    Yhat_r = {k: cv_predict_full(est, Y, kf_r) for k, est in lrn.items()}
    Dhat_r = {k: cv_predict_full(est, D, kf_r) for k, est in lrn.items()}

    Yhat_mat = np.column_stack([Yhat_r["OLS"], Yhat_r["Lasso"], Yhat_r["RF"]])
    Dhat_mat = np.column_stack([Dhat_r["OLS"], Dhat_r["Lasso"], Dhat_r["RF"]])
    w_Y = stack_nnls1(Yhat_mat, Y)
    w_D = stack_nnls1(Dhat_mat, D)

    resY = Y - Yhat_mat @ w_Y
    resD = D - Dhat_mat @ w_D
    fit_r = sm.OLS(resY, sm.add_constant(resD)).fit().get_robustcov_results(cov_type="HC1")

    coef_vec[r] = fit_r.params[1]
    se_vec[r]   = fit_r.bse[1]
    wY_all[:, r] = w_Y
    wD_all[:, r] = w_D

# Chernozhukov median aggregation
theta_med = float(np.median(coef_vec))
se_med    = float(np.sqrt(np.median(se_vec**2 + (coef_vec - theta_med)**2)))
theta_mean, se_mean = float(np.mean(coef_vec)), float(np.mean(se_vec))

agg = pd.DataFrame({"spec": ["shortstack mean", "shortstack median"],
                    "b":    [theta_mean, theta_med],
                    "se":   [se_mean,    se_med]})
print("\nMean/median aggregated short-stack results:")
print(agg.to_string(index=False))

Output
Model:                  partial, crossfit folds k=10, resamples r=11
Mata global (mname):    m0
Dependent variable (Y): A198new
 A198new learners:      Y1_pystacked
D equations (1):        sd_EE
 sd_EE learners:        D1_pystacked

DDML estimation results:
spec  r     Y learner     D learner         b        SE 
  ss  1  [shortstack]          [ss]    -2.326    (0.802)
  ss  2  [shortstack]          [ss]    -1.907    (0.769)
  ss  3  [shortstack]          [ss]    -2.141    (0.723)
  ss  4  [shortstack]          [ss]    -2.058    (0.790)
  ss  5  [shortstack]          [ss]    -2.265    (0.797)
  ss  6  [shortstack]          [ss]    -2.108    (0.764)
  ss  7  [shortstack]          [ss]    -2.252    (0.735)
  ss  8  [shortstack]          [ss]    -1.892    (0.733)
  ss  9  [shortstack]          [ss]    -1.941    (0.749)
  ss 10  [shortstack]          [ss]    -2.231    (0.735)
  ss 11  [shortstack]          [ss]    -1.911    (0.763)

Mean/med    Y learner     D learner         b        SE 
  ss mn  [shortstack]          [ss]    -2.094    (0.774)
  ss md  [shortstack]          [ss]    -2.108    (0.767)

Shortstack DDML model (median over 11 resamples)
y-E[y|X]  = y-Y_A198new_ss                         Number of obs   =        74
D-E[D|X]  = D-D_sd_EE_ss 
------------------------------------------------------------------------------
             |               Robust
     A198new | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       sd_EE |  -2.108401   .7674957    -2.75   0.006    -3.612665   -.6041373
------------------------------------------------------------------------------
Stacking final estimator: nnls1

Summary over 11 resamples:
       D eqn      mean       min       p25       p50       p75       max
       sd_EE     -2.0937   -2.3256   -2.2521   -2.1084   -1.9105   -1.8915

Mean/median aggregated short-stack results:
              spec         b        se
   shortstack mean -2.066484 0.7916755
 shortstack median -2.080807 0.7847372

Mean/median aggregated short-stack results:
             spec         b       se
  shortstack mean -2.174145 0.771204
shortstack median -2.180015 0.766703

The median results from the 11 DML estimations are again similar to the original GN results. The conclusion is again that their specification is robust to dropping the linearity assumption.

Note that the variation across the 11 refits introduced by the randomization of the cross-fit split is not trivial - the DML coefficient estimates range from about -1.7 to about -2.3 - but not enough to overturn the conclusion above. Increasing the number of refits is worth considering here.

It’s also still the case that there is evidence of nonlinearity: all 3 stacking procedures tend to put a low weight on unregularized OLS. Now the nonlinear random forest learner typically gets the biggest weight. Again, this is evidence of some nonlinearity, but not enough to overturn the OLS-based results. (Below we only show the short-stacking weights for the sake of brevity.)

Output: short-stack weights across the 11 resamples
short-stacked weights across resamples for A198new
final stacking estimator: nnls1
             learner  mean_weight        rep_1        rep_2        rep_3        rep_4        rep_5        rep_6        rep_7        rep_8        rep_9       rep_10       rep_11
    ols            1    .27806656    .25021557    .44605087    .15316251    .68824075            0    6.253e-18    .11562621    .08584223    .34507732    .49642475    .47809191
lassocv            2    .15544023    3.903e-18      .112465    .31531427    9.758e-19     .4976324    .31023966    .17962807    .28355977            0    2.602e-18    .01100336
     rf            3    .56649321    .74978443    .44148413    .53152322    .31175925     .5023676    .68976034    .70474572      .630598    .65492268    .50357525    .51090473

short-stacked weights across resamples for sd_EE
final stacking estimator: nnls1
             learner  mean_weight        rep_1        rep_2        rep_3        rep_4        rep_5        rep_6        rep_7        rep_8        rep_9       rep_10       rep_11
    ols            1    2.650e-18    8.376e-18    4.498e-18            0    3.175e-18            0    2.129e-20    5.571e-18    7.513e-18            0            0            0
lassocv            2     .3317023    .37136171     .3452035    .32768063    .27627495     .3283087    .38245885    .40774363    .37568257    .26729688    .30843848    .25827544
     rf            3     .6682977    .62863829     .6547965    .67231937    .72372505     .6716913    .61754115    .59225637    .62431743    .73270312    .69156152    .74172456

short-stacked weights across resamples for A198new (11 reps)
        learner mean_weight     rep_1     rep_2     rep_3   rep_4     rep_5     rep_6
ols         ols   0.1175242 0.0000000 0.1152903 0.0000000 0.00000 0.0000000 0.2886355
lassocv lassocv   0.1290396 0.3513153 0.0000000 0.4379394 0.00000 0.0000000 0.0000000
rf           rf   0.7534362 0.1435329 0.1280935 0.1165952 0.22433 0.0000000 0.3811269
             rep_7     rep_8     rep_9    rep_10    rep_11
ols     0.0775156 0.2935283 0.8564671 0.8719065 0.7547676
lassocv 0.1286372 0.7756691 0.8847097 0.6188731 0.9743389
rf      0.0256611 0.6486847 0.7113645 0.4845450 0.7064717

short-stacked weights across resamples for sd_EE (11 reps)
        learner mean_weight     rep_1     rep_2     rep_3     rep_4     rep_5     rep_6
ols         ols  0.01031469 0.0000000 0.0000000 0.0000000 0.0000000 0.0963622 0.0833967
lassocv lassocv  0.15198133 0.0000000 0.1134616 0.0000000 0.0000000 0.1697869 0.1144639
rf           rf  0.83770398 0.0000000 0.0000000 0.0000000 0.1445204 0.1756197 0.2032019
            rep_7     rep_8     rep_9    rep_10    rep_11
ols     0.1236532 0.1996723 0.8302131 0.8855361 0.8000375
lassocv 0.1999625 0.8554796 0.8243803 0.7967981 0.8388451
rf      0.1611549 0.9036378 0.8031417 0.8763468 0.8003277

short-stack NNLS weights across resamples for A198new (11 reps)
       mean_weight  rep_1  rep_2  rep_3  rep_4  rep_5  rep_6  rep_7  rep_8  rep_9  rep_10  rep_11
OLS         0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000  0.0000  0.0000
Lasso       0.4637 0.5669 0.3860 0.2935 0.3451 0.4739 0.6330 0.4728 0.5867 0.3705  0.3524  0.6203
RF          0.5363 0.4331 0.6140 0.7065 0.6549 0.5261 0.3670 0.5272 0.4133 0.6295  0.6476  0.3797

short-stack NNLS weights across resamples for sd_EE (11 reps)
       mean_weight  rep_1  rep_2  rep_3  rep_4  rep_5  rep_6  rep_7  rep_8  rep_9  rep_10  rep_11
OLS         0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000  0.0000  0.0000
Lasso       0.3369 0.3817 0.3668 0.4787 0.3582 0.3828 0.3059 0.2707 0.1738 0.2787  0.4149  0.2942
RF          0.6631 0.6183 0.6332 0.5213 0.6418 0.6172 0.6941 0.7293 0.8262 0.7213  0.5851  0.7058

  1. Short-stacking stacking is computationally appealing but there are other options. Standard stacking - stacking separately for each cross-fit estimation - is also a possibility. “Pooled stacking” is similar to standard stacking except that the weights for combining learners are based on the OOS predictions for the entire sample (rather than for each cross-fit fold separately). See the discussion in Ahrens et al. (2025)