# Algorithm

Implementation of Farrell-Liang-Misra influence function-based inference for neural networks.

```{toctree}
:hidden:
:maxdepth: 1
:caption: Algorithm
```

---

## Overview

The package provides valid confidence intervals for neural network estimates by:

1. Training a structural network θ̂(x) via cross-fitting
2. Computing influence functions ψᵢ to correct for regularization bias
3. Using √n-consistent variance estimation

**Core formula:**
$$\psi_i = H(\hat{\theta}_i) - H_\theta \cdot \Lambda(x_i)^{-1} \cdot \ell_\theta(z_i, \hat{\theta}_i)$$

---

## Algorithm Flow

```
For each fold k = 1, ..., K:
    1. Train θ̂_k(x) on I_k^c (training set)
    2. Compute scores: ℓ_θ(zᵢ, θ̂_k(xᵢ)) for i ∈ I_k
    3. Compute Hessians: ℓ_θθ(zᵢ, θ̂_k(xᵢ)) for i ∈ I_k
    4. Estimate Λ̂_k(x) from Hessians
    5. Assemble ψᵢ = h - H_θ · Λ̂⁻¹ · ℓ_θ

Aggregate:
    μ̂ = (1/n) Σᵢ ψᵢ
    SE = √(Var(ψ)/n)
```

> "Cross-fitting prevents overfitting: the model is trained on data independent of where it is evaluated. This ensures the influence function correction is unbiased."
> — FLM (2021), Algorithm 1

---

## Lambda Estimation (Critical)

The conditional Hessian Λ(x) = E[ℓ_θθ|X=x] has three estimation regimes:

> **Remark 4 (FLM 2021):** "The conditional Hessian Λ(x) = E[ℓ_θθ|X=x] varies with x through θ(x). When treatment is randomized, Λ can be computed analytically; otherwise it must be estimated, requiring three-way sample splitting."

### Regime A: Randomized Experiments (`ComputeLambda`)

When treatment T is randomized and independent of X:
```python
# Monte Carlo integration over treatment distribution
Λ(x) = E_T[ℓ_θθ(Y, T, θ(x)) | X=x]
     ≈ (1/M) Σₘ ℓ_θθ(y, tₘ, θ(x))  where tₘ ~ P(T)
```

Use: `inference(..., is_randomized=True, treatment_dist=Normal(0, 1))`

### Regime B: Linear Models (`AnalyticLambda`)

For linear models, Λ(x) = E[TT'|X=x] has closed form:
```python
# No θ-dependence, so 2-way splitting suffices
Λ(x) = E[(1,T)(1,T)' | X=x]
```

Use: `inference(..., model='linear')` (auto-detected)

### Regime C: Observational Nonlinear (`EstimateLambda`)

For logit, Poisson, etc. in observational settings:

| Method | Correlation | Coverage | Speed | Use Case |
|--------|-------------|----------|-------|----------|
| `ridge` | 0.508 | **96%** | 0.08s | **Recommended (default)** |
| `aggregate` | 0.000 | 95% | 0.02s | Ignores heterogeneity |
| `lgbm` | 0.978 | 96% | 1.2s | High accuracy |
| `rf` | 0.904 | ~90% | 0.3s | Moderate accuracy |
| `mlp` | 0.997 | **67%** | 12.5s | **AVOID** - invalid SEs |

**Key finding:** MLP achieves highest correlation (0.997) but produces **invalid standard errors** (67% coverage instead of 95%). Ridge is the safe default with validated 96% coverage.

Use: `structural_dml(..., lambda_method='ridge')` (default) or `inference(..., lambda_method='lgbm')`

---

## Three-Way Splitting

### When It's Needed

Three-way splitting is required when Λ(x) depends on θ(x).

**Two-way suffices:**
- Linear models (Λ = E[TT'|X], no θ-dependence)
- Randomized experiments (can compute Λ analytically)

**Three-way required:**
- Logit, Poisson, Gamma, Weibull, etc.
- Hessian weights depend on θ: e.g., p(1-p) for logit where p = σ(θ'x)

> "When the Hessian weight depends on θ (as in logit where w = p(1-p) and p = σ(θ'x)), separate samples are needed to estimate θ̂ and Λ̂ to avoid overfitting bias."
> — FLM (2021), Section 3.2

### Auto-Detection

The package auto-detects θ-dependence:
```python
# From core/algorithm.py
def detect_theta_dependence(family, X, T, theta):
    """Sample 100 observations, perturb θ, check if Hessian changes."""
    sample_idx = np.random.choice(len(X), min(100, len(X)))
    H1 = family.hessian(Y[sample_idx], T[sample_idx], theta[sample_idx])
    H2 = family.hessian(Y[sample_idx], T[sample_idx], theta[sample_idx] + 0.1)
    return not np.allclose(H1, H2, rtol=1e-5)
```

### How It Works

When `three_way=True` (or auto-detected):
```
For each fold k:
    Training set I_k^c is split:
    - 60% for training θ̂_k(x)
    - 40% for fitting Λ̂_k(x)
```

This prevents overfitting: θ̂ and Λ̂ are estimated on independent data.

---

## Regularization

### Lambda Inversion

Inverting Λ(x) can be unstable when Hessians are near-singular. The package uses Tikhonov regularization:

```python
# From utils/linalg.py
def batch_inverse(Lambda, ridge=1e-4):
    """Invert with ridge regularization for stability."""
    Lambda_reg = Lambda + ridge * torch.eye(Lambda.shape[-1])
    return torch.linalg.inv(Lambda_reg)
```

**Default:** `ridge=1e-4`

### Diagnostics

The package tracks `min_lambda_eigenvalue`:
```
Warning: min_lambda_eigenvalue=1.23e-06 < 1e-4
Consider increasing ridge regularization or checking model fit.
```

**When to increase regularization:**
- Near-singular Hessians (min eigenvalue < 1e-4)
- Logit with p ≈ 0 or p ≈ 1 (separation issues)
- Overdispersed count models

---

## Variance Estimation

### Within-Fold Formula

```python
# From core/algorithm.py
Ψ̂ = (1/K) Σ_k Var_k(ψ)
  = (1/K) Σ_k (1/|I_k|) Σ_{i ∈ I_k} (ψᵢ - μ̂_k)²

SE = √(Ψ̂/n)
```

Where μ̂_k is the fold-specific mean.

> **Theorem 3 (FLM 2021):** "The variance estimator $\hat{V} = n^{-1}\sum_i \hat{\psi}_i^2$ is consistent: $\hat{V} \xrightarrow{p} V$. The resulting confidence intervals achieve nominal coverage asymptotically."

### High Correction Variance Warning

```
Warning: High correction variance ratio (43.96).
This suggests the influence function correction dominates the estimate variance.
Consider using more cross-fitting folds (K >= 50).
```

This warns when:
```python
correction_variance / total_variance > 10
```

Indicates the bias correction term H_θ · Λ⁻¹ · ℓ_θ has high variance, often from:
- Insufficient cross-fitting folds
- Poor θ̂(x) estimation in some regions
- Near-singular Lambda

---

## Two APIs

### Legacy: `structural_dml()`

For the 8 built-in GLM families:
```python
from deep_inference import structural_dml

result = structural_dml(
    Y=Y, T=T, X=X,
    family='logit',           # linear, logit, poisson, gamma, ...
    # lambda_method='ridge' is default (validated coverage)
    n_folds=50,
    epochs=100,
)
```

### New: `inference()`

For flexible targets and custom models:
```python
from deep_inference import inference

result = inference(
    Y=Y, T=T, X=X,
    model='logit',
    target='ame',              # ame, beta, or custom
    # lambda_method='ridge' is default (validated coverage)
    is_randomized=False,
)
```

Both use the same core algorithm; `inference()` adds:
- Regime detection (A/B/C)
- Custom target functions
- Treatment distribution specification

---

## Implementation Files

| File | Purpose |
|------|---------|
| `core/algorithm.py` | Main `structural_dml_core()` with fold logic |
| `engine/crossfit.py` | Modular `CrossFitter` class |
| `engine/assembler.py` | Influence function assembly |
| `engine/variance.py` | SE computation |
| `lambda_/estimate.py` | MLP/LGBM Lambda estimation |
| `lambda_/selector.py` | Regime detection |
| `autodiff/` | Score, Hessian, Jacobian computation |

---

## Pseudocode (Actual Implementation)

```python
def structural_dml_core(Y, T, X, family, lambda_method, n_folds, ...):
    # Phase 1: Create folds
    kf = KFold(n_splits=n_folds, shuffle=True)

    # Detect if 3-way splitting needed
    three_way = detect_theta_dependence(family, X, T, initial_theta)

    psi_all = []
    for fold_idx, (train_idx, eval_idx) in enumerate(kf.split(X)):
        # Phase 2a: Train θ̂(x)
        if three_way:
            theta_train = train_idx[:int(0.6 * len(train_idx))]
            lambda_train = train_idx[int(0.6 * len(train_idx)):]
        else:
            theta_train = train_idx
            lambda_train = train_idx

        model = train_structural_net(X[theta_train], T[theta_train], Y[theta_train])
        theta_hat = model(X[eval_idx])

        # Phase 2b: Compute Hessians on lambda_train
        hessians = family.hessian(Y[lambda_train], T[lambda_train], theta_hat_train)

        # Phase 2c: Fit Lambda
        if lambda_method == 'aggregate':
            Lambda = hessians.mean(axis=0)  # Single matrix for all x
        else:
            Lambda_model = train_lambda_estimator(X[lambda_train], hessians)
            Lambda = Lambda_model(X[eval_idx])

        # Phase 3: Assemble influence function
        for i in eval_idx:
            score = family.gradient(Y[i], T[i], theta_hat[i])
            h = target.h(theta_hat[i])
            h_jacobian = target.jacobian(theta_hat[i])

            Lambda_inv = batch_inverse(Lambda[i], ridge=1e-4)
            psi_i = h - h_jacobian @ Lambda_inv @ score
            psi_all.append(psi_i)

    # Phase 4: Aggregate
    mu_hat = np.mean(psi_all)
    se = np.sqrt(np.var(psi_all) / n)

    return DMLResult(mu_hat=mu_hat, se=se, ...)
```

---

## References

- Farrell, Liang, Misra (2021): "Deep Neural Networks for Estimation and Inference" *Econometrica*
- Farrell, Liang, Misra (2025): "Deep Learning for Individual Heterogeneity" *Working Paper*