Algorithm#

Implementation of Farrell-Liang-Misra influence function-based inference for neural networks.


Overview#

The package provides valid confidence intervals for neural network estimates by:

  1. Training a structural network θ̂(x) via cross-fitting

  2. Computing influence functions ψᵢ to correct for regularization bias

  3. Using √n-consistent variance estimation

Core formula: $\(\psi_i = H(\hat{\theta}_i) - H_\theta \cdot \Lambda(x_i)^{-1} \cdot \ell_\theta(z_i, \hat{\theta}_i)\)$


Algorithm Flow#

For each fold k = 1, ..., K:
    1. Train θ̂_k(x) on I_k^c (training set)
    2. Compute scores: ℓ_θ(zᵢ, θ̂_k(xᵢ)) for i ∈ I_k
    3. Compute Hessians: ℓ_θθ(zᵢ, θ̂_k(xᵢ)) for i ∈ I_k
    4. Estimate Λ̂_k(x) from Hessians
    5. Assemble ψᵢ = h - H_θ · Λ̂⁻¹ · ℓ_θ

Aggregate:
    μ̂ = (1/n) Σᵢ ψᵢ
    SE = √(Var(ψ)/n)

“Cross-fitting prevents overfitting: the model is trained on data independent of where it is evaluated. This ensures the influence function correction is unbiased.” — FLM (2021), Algorithm 1


Lambda Estimation (Critical)#

The conditional Hessian Λ(x) = E[ℓ_θθ|X=x] has three estimation regimes:

Remark 4 (FLM 2021): “The conditional Hessian Λ(x) = E[ℓ_θθ|X=x] varies with x through θ(x). When treatment is randomized, Λ can be computed analytically; otherwise it must be estimated, requiring three-way sample splitting.”

Regime A: Randomized Experiments (ComputeLambda)#

When treatment T is randomized and independent of X:

# Monte Carlo integration over treatment distribution
Λ(x) = E_T[ℓ_θθ(Y, T, θ(x)) | X=x]
      (1/M) Σₘ ℓ_θθ(y, tₘ, θ(x))  where tₘ ~ P(T)

Use: inference(..., is_randomized=True, treatment_dist=Normal(0, 1))

Regime B: Linear Models (AnalyticLambda)#

For linear models, Λ(x) = E[TT’|X=x] has closed form:

# No θ-dependence, so 2-way splitting suffices
Λ(x) = E[(1,T)(1,T)' | X=x]

Use: inference(..., model='linear') (auto-detected)

Regime C: Observational Nonlinear (EstimateLambda)#

For logit, Poisson, etc. in observational settings:

Method

Correlation

Coverage

Speed

Use Case

ridge

0.508

96%

0.08s

Recommended (default)

aggregate

0.000

95%

0.02s

Ignores heterogeneity

lgbm

0.978

96%

1.2s

High accuracy

rf

0.904

~90%

0.3s

Moderate accuracy

mlp

0.997

67%

12.5s

AVOID - invalid SEs

Key finding: MLP achieves highest correlation (0.997) but produces invalid standard errors (67% coverage instead of 95%). Ridge is the safe default with validated 96% coverage.

Use: structural_dml(..., lambda_method='ridge') (default) or inference(..., lambda_method='lgbm')


Three-Way Splitting#

When It’s Needed#

Three-way splitting is required when Λ(x) depends on θ(x).

Two-way suffices:

  • Linear models (Λ = E[TT’|X], no θ-dependence)

  • Randomized experiments (can compute Λ analytically)

Three-way required:

  • Logit, Poisson, Gamma, Weibull, etc.

  • Hessian weights depend on θ: e.g., p(1-p) for logit where p = σ(θ’x)

“When the Hessian weight depends on θ (as in logit where w = p(1-p) and p = σ(θ’x)), separate samples are needed to estimate θ̂ and Λ̂ to avoid overfitting bias.” — FLM (2021), Section 3.2

Auto-Detection#

The package auto-detects θ-dependence:

# From core/algorithm.py
def detect_theta_dependence(family, X, T, theta):
    """Sample 100 observations, perturb θ, check if Hessian changes."""
    sample_idx = np.random.choice(len(X), min(100, len(X)))
    H1 = family.hessian(Y[sample_idx], T[sample_idx], theta[sample_idx])
    H2 = family.hessian(Y[sample_idx], T[sample_idx], theta[sample_idx] + 0.1)
    return not np.allclose(H1, H2, rtol=1e-5)

How It Works#

When three_way=True (or auto-detected):

For each fold k:
    Training set I_k^c is split:
    - 60% for training θ̂_k(x)
    - 40% for fitting Λ̂_k(x)

This prevents overfitting: θ̂ and Λ̂ are estimated on independent data.


Regularization#

Lambda Inversion#

Inverting Λ(x) can be unstable when Hessians are near-singular. The package uses Tikhonov regularization:

# From utils/linalg.py
def batch_inverse(Lambda, ridge=1e-4):
    """Invert with ridge regularization for stability."""
    Lambda_reg = Lambda + ridge * torch.eye(Lambda.shape[-1])
    return torch.linalg.inv(Lambda_reg)

Default: ridge=1e-4

Diagnostics#

The package tracks min_lambda_eigenvalue:

Warning: min_lambda_eigenvalue=1.23e-06 < 1e-4
Consider increasing ridge regularization or checking model fit.

When to increase regularization:

  • Near-singular Hessians (min eigenvalue < 1e-4)

  • Logit with p ≈ 0 or p ≈ 1 (separation issues)

  • Overdispersed count models


Variance Estimation#

Within-Fold Formula#

# From core/algorithm.py
Ψ̂ = (1/K) Σ_k Var_k(ψ)
  = (1/K) Σ_k (1/|I_k|) Σ_{i  I_k} (ψᵢ - μ̂_k)²

SE = (Ψ̂/n)

Where μ̂_k is the fold-specific mean.

Theorem 3 (FLM 2021): “The variance estimator \(\hat{V} = n^{-1}\sum_i \hat{\psi}_i^2\) is consistent: \(\hat{V} \xrightarrow{p} V\). The resulting confidence intervals achieve nominal coverage asymptotically.”

High Correction Variance Warning#

Warning: High correction variance ratio (43.96).
This suggests the influence function correction dominates the estimate variance.
Consider using more cross-fitting folds (K >= 50).

This warns when:

correction_variance / total_variance > 10

Indicates the bias correction term H_θ · Λ⁻¹ · ℓ_θ has high variance, often from:

  • Insufficient cross-fitting folds

  • Poor θ̂(x) estimation in some regions

  • Near-singular Lambda


Two APIs#

Legacy: structural_dml()#

For the 8 built-in GLM families:

from deep_inference import structural_dml

result = structural_dml(
    Y=Y, T=T, X=X,
    family='logit',           # linear, logit, poisson, gamma, ...
    # lambda_method='ridge' is default (validated coverage)
    n_folds=50,
    epochs=100,
)

New: inference()#

For flexible targets and custom models:

from deep_inference import inference

result = inference(
    Y=Y, T=T, X=X,
    model='logit',
    target='ame',              # ame, beta, or custom
    # lambda_method='ridge' is default (validated coverage)
    is_randomized=False,
)

Both use the same core algorithm; inference() adds:

  • Regime detection (A/B/C)

  • Custom target functions

  • Treatment distribution specification


Implementation Files#

File

Purpose

core/algorithm.py

Main structural_dml_core() with fold logic

engine/crossfit.py

Modular CrossFitter class

engine/assembler.py

Influence function assembly

engine/variance.py

SE computation

lambda_/estimate.py

MLP/LGBM Lambda estimation

lambda_/selector.py

Regime detection

autodiff/

Score, Hessian, Jacobian computation


Pseudocode (Actual Implementation)#

def structural_dml_core(Y, T, X, family, lambda_method, n_folds, ...):
    # Phase 1: Create folds
    kf = KFold(n_splits=n_folds, shuffle=True)

    # Detect if 3-way splitting needed
    three_way = detect_theta_dependence(family, X, T, initial_theta)

    psi_all = []
    for fold_idx, (train_idx, eval_idx) in enumerate(kf.split(X)):
        # Phase 2a: Train θ̂(x)
        if three_way:
            theta_train = train_idx[:int(0.6 * len(train_idx))]
            lambda_train = train_idx[int(0.6 * len(train_idx)):]
        else:
            theta_train = train_idx
            lambda_train = train_idx

        model = train_structural_net(X[theta_train], T[theta_train], Y[theta_train])
        theta_hat = model(X[eval_idx])

        # Phase 2b: Compute Hessians on lambda_train
        hessians = family.hessian(Y[lambda_train], T[lambda_train], theta_hat_train)

        # Phase 2c: Fit Lambda
        if lambda_method == 'aggregate':
            Lambda = hessians.mean(axis=0)  # Single matrix for all x
        else:
            Lambda_model = train_lambda_estimator(X[lambda_train], hessians)
            Lambda = Lambda_model(X[eval_idx])

        # Phase 3: Assemble influence function
        for i in eval_idx:
            score = family.gradient(Y[i], T[i], theta_hat[i])
            h = target.h(theta_hat[i])
            h_jacobian = target.jacobian(theta_hat[i])

            Lambda_inv = batch_inverse(Lambda[i], ridge=1e-4)
            psi_i = h - h_jacobian @ Lambda_inv @ score
            psi_all.append(psi_i)

    # Phase 4: Aggregate
    mu_hat = np.mean(psi_all)
    se = np.sqrt(np.var(psi_all) / n)

    return DMLResult(mu_hat=mu_hat, se=se, ...)

References#

  • Farrell, Liang, Misra (2021): “Deep Neural Networks for Estimation and Inference” Econometrica

  • Farrell, Liang, Misra (2025): “Deep Learning for Individual Heterogeneity” Working Paper