Algorithm#
Implementation of Farrell-Liang-Misra influence function-based inference for neural networks.
Overview#
The package provides valid confidence intervals for neural network estimates by:
Training a structural network θ̂(x) via cross-fitting
Computing influence functions ψᵢ to correct for regularization bias
Using √n-consistent variance estimation
Core formula: $\(\psi_i = H(\hat{\theta}_i) - H_\theta \cdot \Lambda(x_i)^{-1} \cdot \ell_\theta(z_i, \hat{\theta}_i)\)$
Algorithm Flow#
For each fold k = 1, ..., K:
1. Train θ̂_k(x) on I_k^c (training set)
2. Compute scores: ℓ_θ(zᵢ, θ̂_k(xᵢ)) for i ∈ I_k
3. Compute Hessians: ℓ_θθ(zᵢ, θ̂_k(xᵢ)) for i ∈ I_k
4. Estimate Λ̂_k(x) from Hessians
5. Assemble ψᵢ = h - H_θ · Λ̂⁻¹ · ℓ_θ
Aggregate:
μ̂ = (1/n) Σᵢ ψᵢ
SE = √(Var(ψ)/n)
“Cross-fitting prevents overfitting: the model is trained on data independent of where it is evaluated. This ensures the influence function correction is unbiased.” — FLM (2021), Algorithm 1
Lambda Estimation (Critical)#
The conditional Hessian Λ(x) = E[ℓ_θθ|X=x] has three estimation regimes:
Remark 4 (FLM 2021): “The conditional Hessian Λ(x) = E[ℓ_θθ|X=x] varies with x through θ(x). When treatment is randomized, Λ can be computed analytically; otherwise it must be estimated, requiring three-way sample splitting.”
Regime A: Randomized Experiments (ComputeLambda)#
When treatment T is randomized and independent of X:
# Monte Carlo integration over treatment distribution
Λ(x) = E_T[ℓ_θθ(Y, T, θ(x)) | X=x]
≈ (1/M) Σₘ ℓ_θθ(y, tₘ, θ(x)) where tₘ ~ P(T)
Use: inference(..., is_randomized=True, treatment_dist=Normal(0, 1))
Regime B: Linear Models (AnalyticLambda)#
For linear models, Λ(x) = E[TT’|X=x] has closed form:
# No θ-dependence, so 2-way splitting suffices
Λ(x) = E[(1,T)(1,T)' | X=x]
Use: inference(..., model='linear') (auto-detected)
Regime C: Observational Nonlinear (EstimateLambda)#
For logit, Poisson, etc. in observational settings:
Method |
Correlation |
Coverage |
Speed |
Use Case |
|---|---|---|---|---|
|
0.508 |
96% |
0.08s |
Recommended (default) |
|
0.000 |
95% |
0.02s |
Ignores heterogeneity |
|
0.978 |
96% |
1.2s |
High accuracy |
|
0.904 |
~90% |
0.3s |
Moderate accuracy |
|
0.997 |
67% |
12.5s |
AVOID - invalid SEs |
Key finding: MLP achieves highest correlation (0.997) but produces invalid standard errors (67% coverage instead of 95%). Ridge is the safe default with validated 96% coverage.
Use: structural_dml(..., lambda_method='ridge') (default) or inference(..., lambda_method='lgbm')
Three-Way Splitting#
When It’s Needed#
Three-way splitting is required when Λ(x) depends on θ(x).
Two-way suffices:
Linear models (Λ = E[TT’|X], no θ-dependence)
Randomized experiments (can compute Λ analytically)
Three-way required:
Logit, Poisson, Gamma, Weibull, etc.
Hessian weights depend on θ: e.g., p(1-p) for logit where p = σ(θ’x)
“When the Hessian weight depends on θ (as in logit where w = p(1-p) and p = σ(θ’x)), separate samples are needed to estimate θ̂ and Λ̂ to avoid overfitting bias.” — FLM (2021), Section 3.2
Auto-Detection#
The package auto-detects θ-dependence:
# From core/algorithm.py
def detect_theta_dependence(family, X, T, theta):
"""Sample 100 observations, perturb θ, check if Hessian changes."""
sample_idx = np.random.choice(len(X), min(100, len(X)))
H1 = family.hessian(Y[sample_idx], T[sample_idx], theta[sample_idx])
H2 = family.hessian(Y[sample_idx], T[sample_idx], theta[sample_idx] + 0.1)
return not np.allclose(H1, H2, rtol=1e-5)
How It Works#
When three_way=True (or auto-detected):
For each fold k:
Training set I_k^c is split:
- 60% for training θ̂_k(x)
- 40% for fitting Λ̂_k(x)
This prevents overfitting: θ̂ and Λ̂ are estimated on independent data.
Regularization#
Lambda Inversion#
Inverting Λ(x) can be unstable when Hessians are near-singular. The package uses Tikhonov regularization:
# From utils/linalg.py
def batch_inverse(Lambda, ridge=1e-4):
"""Invert with ridge regularization for stability."""
Lambda_reg = Lambda + ridge * torch.eye(Lambda.shape[-1])
return torch.linalg.inv(Lambda_reg)
Default: ridge=1e-4
Diagnostics#
The package tracks min_lambda_eigenvalue:
Warning: min_lambda_eigenvalue=1.23e-06 < 1e-4
Consider increasing ridge regularization or checking model fit.
When to increase regularization:
Near-singular Hessians (min eigenvalue < 1e-4)
Logit with p ≈ 0 or p ≈ 1 (separation issues)
Overdispersed count models
Variance Estimation#
Within-Fold Formula#
# From core/algorithm.py
Ψ̂ = (1/K) Σ_k Var_k(ψ)
= (1/K) Σ_k (1/|I_k|) Σ_{i ∈ I_k} (ψᵢ - μ̂_k)²
SE = √(Ψ̂/n)
Where μ̂_k is the fold-specific mean.
Theorem 3 (FLM 2021): “The variance estimator \(\hat{V} = n^{-1}\sum_i \hat{\psi}_i^2\) is consistent: \(\hat{V} \xrightarrow{p} V\). The resulting confidence intervals achieve nominal coverage asymptotically.”
High Correction Variance Warning#
Warning: High correction variance ratio (43.96).
This suggests the influence function correction dominates the estimate variance.
Consider using more cross-fitting folds (K >= 50).
This warns when:
correction_variance / total_variance > 10
Indicates the bias correction term H_θ · Λ⁻¹ · ℓ_θ has high variance, often from:
Insufficient cross-fitting folds
Poor θ̂(x) estimation in some regions
Near-singular Lambda
Two APIs#
Legacy: structural_dml()#
For the 8 built-in GLM families:
from deep_inference import structural_dml
result = structural_dml(
Y=Y, T=T, X=X,
family='logit', # linear, logit, poisson, gamma, ...
# lambda_method='ridge' is default (validated coverage)
n_folds=50,
epochs=100,
)
New: inference()#
For flexible targets and custom models:
from deep_inference import inference
result = inference(
Y=Y, T=T, X=X,
model='logit',
target='ame', # ame, beta, or custom
# lambda_method='ridge' is default (validated coverage)
is_randomized=False,
)
Both use the same core algorithm; inference() adds:
Regime detection (A/B/C)
Custom target functions
Treatment distribution specification
Implementation Files#
File |
Purpose |
|---|---|
|
Main |
|
Modular |
|
Influence function assembly |
|
SE computation |
|
MLP/LGBM Lambda estimation |
|
Regime detection |
|
Score, Hessian, Jacobian computation |
Pseudocode (Actual Implementation)#
def structural_dml_core(Y, T, X, family, lambda_method, n_folds, ...):
# Phase 1: Create folds
kf = KFold(n_splits=n_folds, shuffle=True)
# Detect if 3-way splitting needed
three_way = detect_theta_dependence(family, X, T, initial_theta)
psi_all = []
for fold_idx, (train_idx, eval_idx) in enumerate(kf.split(X)):
# Phase 2a: Train θ̂(x)
if three_way:
theta_train = train_idx[:int(0.6 * len(train_idx))]
lambda_train = train_idx[int(0.6 * len(train_idx)):]
else:
theta_train = train_idx
lambda_train = train_idx
model = train_structural_net(X[theta_train], T[theta_train], Y[theta_train])
theta_hat = model(X[eval_idx])
# Phase 2b: Compute Hessians on lambda_train
hessians = family.hessian(Y[lambda_train], T[lambda_train], theta_hat_train)
# Phase 2c: Fit Lambda
if lambda_method == 'aggregate':
Lambda = hessians.mean(axis=0) # Single matrix for all x
else:
Lambda_model = train_lambda_estimator(X[lambda_train], hessians)
Lambda = Lambda_model(X[eval_idx])
# Phase 3: Assemble influence function
for i in eval_idx:
score = family.gradient(Y[i], T[i], theta_hat[i])
h = target.h(theta_hat[i])
h_jacobian = target.jacobian(theta_hat[i])
Lambda_inv = batch_inverse(Lambda[i], ridge=1e-4)
psi_i = h - h_jacobian @ Lambda_inv @ score
psi_all.append(psi_i)
# Phase 4: Aggregate
mu_hat = np.mean(psi_all)
se = np.sqrt(np.var(psi_all) / n)
return DMLResult(mu_hat=mu_hat, se=se, ...)
References#
Farrell, Liang, Misra (2021): “Deep Neural Networks for Estimation and Inference” Econometrica
Farrell, Liang, Misra (2025): “Deep Learning for Individual Heterogeneity” Working Paper