Tobit Model Tutorial#

The Tobit model handles censored data where outcomes pile up at a boundary.

When to Use#

Use the Tobit model when:

Outcome is censored at a boundary (typically 0)
You observe the censored value, not the latent value
Examples: labor supply (hours >= 0), expenditure, donations

Mathematical Setup#

Data Generating Process#

\[Y^* = \alpha(X) + \beta(X) \cdot T + \varepsilon, \quad \varepsilon \sim N(0, \sigma^2)\]

\[Y = \max(0, Y^*)\]

Where \(Y^*\) is the latent (unobserved) variable and \(Y\) is the observed (censored) outcome.

Estimand#

\[\mu^* = E[\beta(X)]\]

The average effect on the latent outcome.

Loss Function#

The Tobit likelihood has two parts:

Censored observations (\(Y = 0\)): Probability of \(Y^* \leq 0\)
Uncensored observations (\(Y > 0\)): Normal density

\[L = -\sum_{Y_i=0} \log \Phi\left(-\frac{\mu_i}{\sigma}\right) - \sum_{Y_i>0} \left[\log \phi\left(\frac{Y_i - \mu_i}{\sigma}\right) - \log \sigma\right]\]

Influence Score Components#

Component	Formula
Residual \(r\)	Mills ratio (censored) or \((Y-\mu)/\sigma\) (uncensored)
Hessian weight \(W\)	\(1 - \Phi(-\mu/\sigma)\)
Score \(\nabla\ell\)	Varies by censoring status

The Mills ratio is \(\phi(z)/\Phi(z)\) where \(z = -\mu/\sigma\).

Complete Example#

import numpy as np
from deep_inference import structural_dml

# Generate censored data
np.random.seed(42)
n = 2000
X = np.random.randn(n, 10)
T = np.random.randn(n)

# True structural functions
alpha_true = 0.5 + 0.3 * X[:, 0]
beta_true = 0.3 + 0.2 * X[:, 0]
sigma = 1.0
Y_star = alpha_true + beta_true * T + sigma * np.random.randn(n)
Y = np.maximum(0, Y_star)  # Censoring at 0
mu_true = beta_true.mean()

# Check censoring rate
censored_pct = (Y == 0).mean() * 100
print(f"True mu* = {mu_true:.6f}")
print(f"Censored at 0: {censored_pct:.1f}%")

# Run inference
result = structural_dml(
    Y=Y, T=T, X=X,
    family='tobit',
    hidden_dims=[64, 32],
    epochs=100,
    n_folds=50,
    lr=0.01
)

print(result.summary())

Alternative Targets#

Latent Effect (Default)#

# Default target is effect on latent Y*
result = structural_dml(Y, T, X, family='tobit')
# mu* = E[beta(X)] = effect on latent Y*

Observed Effect#

For observed effect, use the TobitFamily class directly:

from deep_inference import TobitFamily, structural_dml

# Create family with observed target
family = TobitFamily(target='observed')
result = structural_dml(Y, T, X, family=family)
# mu* = E[beta(X) * Phi(mu/sigma)] = effect on observed E[Y]

The observed effect accounts for the probability of being uncensored.

Parameter Structure#

The Tobit model estimates three parameters per observation:

\[\theta(X) = [\alpha(X), \beta(X), \gamma(X)]\]

Where \(\sigma(X) = \exp(\gamma(X))\) is the conditional variance.

Real-World Applications#

Labor Supply#

# Y = hours worked (>= 0)
# T = wage rate
# X = (education, family size, non-labor income, ...)
# Target: E[beta(X)] = average labor supply elasticity

result = structural_dml(Y, T, X, family='tobit')

Charitable Donations#

# Y = donation amount (>= 0)
# T = match rate offered
# X = (income, past giving, solicitation type, ...)
# Target: E[beta(X)] = average matching effect

result = structural_dml(Y, T, X, family='tobit')

Durable Goods Expenditure#

# Y = spending on cars (many zeros)
# T = income change
# X = (current car age, household size, ...)
# Target: E[beta(X)] = average income effect on car spending

result = structural_dml(Y, T, X, family='tobit')

Handling Different Censoring#

Left-censoring at 0 (Default)#

result = structural_dml(Y, T, X, family='tobit')  # Assumes Y >= 0

Right-censoring#

# For data censored from above (e.g., top-coded income)
# Transform: Y_new = upper_bound - Y
# Then use standard Tobit

Two-sided censoring#

# For data censored at both ends
# Requires custom implementation

Key Takeaways#

Latent vs observed: Choose target based on research question
Mills ratio: Key ingredient for censored observations
Joint sigma estimation: Model estimates conditional variance
Check censoring rate: Very high (>50%) or low (<10%) censoring can cause issues
Three parameters: alpha, beta, and gamma (log-sigma)