Multinomial Logit (Conditional Logit) Tutorial#

The multinomial logit model handles discrete choice among \(J \geq 3\) alternatives with heterogeneous preferences.

Training Configuration#

Multinomial logit requires more patient training than binary models due to 3-way cross-fitting splitting:

result = structural_dml(Y, T, X, family='multinomial_logit',
                        n_alternatives=3, n_attributes=2,
                        patience=50, epochs=300)

Key settings:

  • patience=50 (default 10 is too aggressive for 3-way splitting)

  • epochs=300 (needs more epochs to converge)

  • n >= 8000 recommended for valid coverage

When to Use#

Use the multinomial logit model when:

  • Individual chooses one from \(J \geq 3\) discrete alternatives

  • Alternatives have measurable attributes (price, quality, etc.)

  • Parameters vary across individuals (heterogeneous preferences)

  • Examples: transportation mode choice, brand selection, product choice

Mathematical Setup#

Data Generating Process#

\[V_{ij} = \alpha_j(W) + X'_{ij} \cdot \beta(W)\]
\[P(Y=j \mid W, X) = \frac{\exp(V_{ij})}{\sum_m \exp(V_{im})}\]

Where:

  • \(W\) = individual characteristics (covariates, NN input)

  • \(X_{ij}\) = attributes of alternative \(j\) for individual \(i\)

  • \(\alpha_j(W)\) = alternative-specific intercepts (\(\alpha_0 = 0\), normalized)

  • \(\beta(W)\) = attribute coefficients (common across alternatives)

Estimand#

\[\mu^* = E[\beta_k(W)]\]

The average attribute coefficient across the population. For example, the average price sensitivity.

Loss Function#

\[L(Y, X, \theta) = -\log P(Y=y \mid W, X) = -\log \text{softmax}(V)[y]\]

Categorical cross-entropy loss.

Influence Score Components#

Component

Formula

Score \(\partial\ell/\partial\alpha_j\)

\(p_j - \mathbf{1}\{Y=j\}\) (for \(j=1,\ldots,J-1\))

Score \(\partial\ell/\partial\beta_k\)

\(\sum_j (p_j - \mathbf{1}\{Y=j\}) x_{jk}\)

Hessian (Fisher)

\(H_{\alpha\alpha}[j,m] = p_{j+1}(\delta_{jm} - p_{m+1})\)

Hessian \(H_{\beta\beta}\)

\(\sum_j p_j (x_j - \bar{x}_p)(x_j - \bar{x}_p)'\)

Note: The Hessian is the Fisher information (does not depend on \(Y\)), but depends on \(\theta\) through the softmax probabilities.

Data Encoding#

Multinomial logit has a unique data layout compared to other families:

Variable

Shape

Description

W (passed as X)

(n, d_w)

Individual characteristics (NN input)

T

(n, J*K)

Packed alternative attributes

Y

(n,)

Chosen alternative index (float: 0, 1, …, J-1)

Parameter vector: \(\theta = [\alpha_1, \ldots, \alpha_{J-1}, \beta_1, \ldots, \beta_K]\), so theta_dim = (J-1) + K.

Treatment encoding: Alternative attributes are packed as (n, J*K) — each row contains the \(K\) attributes for all \(J\) alternatives concatenated. Inside the model, this is reshaped to (n, J, K).

Complete Example#

import numpy as np
import torch
from deep_inference import structural_dml

# === DGP Setup ===
J = 3   # alternatives
K = 2   # attributes per alternative
d_w = 3 # individual characteristics
n = 8000

np.random.seed(42)

# Individual characteristics W ~ N(0, I)
W = np.random.normal(0, 1, (n, d_w))

# True parameter functions (heterogeneous in W[0])
# alpha_0 = 0 (reference), alpha_1 = 0.5 + 0.2*W[0], alpha_2 = -0.3 - 0.1*W[0]
# beta_1 = -0.8 - 0.2*W[0], beta_2 = 0.5 + 0.1*W[0]
a0 = [0.0, 0.5, -0.3]
a1 = [0.0, 0.2, -0.1]
b0 = [-0.8, 0.5]
b1 = [-0.2, 0.1]

alphas = np.column_stack([a0[j] + a1[j] * W[:, 0] for j in range(J)])
betas = np.column_stack([b0[k] + b1[k] * W[:, 0] for k in range(K)])

# Alternative attributes X ~ N(0, 1)
X_alt = np.random.normal(0, 1, (n, J, K))

# Utilities V_ij = alpha_j + x'_ij * beta
from scipy.special import softmax
V = alphas.copy()
for j in range(J):
    V[:, j] += np.sum(X_alt[:, j, :] * betas, axis=1)
probs = softmax(V, axis=1)

# Sample choices
Y = np.array([np.random.choice(J, p=probs[i]) for i in range(n)]).astype(float)

# Pack T: (n, J, K) -> (n, J*K)
T = X_alt.reshape(n, -1)

# True target: mu* = E[beta_1(W)] = b0[0] = -0.8
mu_true = b0[0]
print(f"True mu* = {mu_true}")
print(f"Y distribution: {[f'{(Y==j).mean():.1%}' for j in range(J)]}")
print(f"Shapes: Y={Y.shape}, T={T.shape}, W={W.shape}")

# === Run Inference ===
result = structural_dml(
    Y=Y, T=T, X=W,
    family='multinomial_logit',
    n_alternatives=J,
    n_attributes=K,
    hidden_dims=[64, 32],
    epochs=300,
    patience=50,
    n_folds=50,
    lr=0.01
)

print(result.summary())

Expected Results#

From Eval 09: Multinomial Logit:

Parameter Recovery#

Component

RMSE

Correlation

Status

\(\alpha_1\)

0.08

0.90

PASS

\(\alpha_2\)

0.12

0.78

PASS

\(\beta_1\)

0.09

0.88

PASS

\(\beta_2\)

0.10

0.85

PASS

Coverage (M=50)#

Metric

Value

Status

Coverage

98%

PASS

SE Ratio

0.97

PASS

z-mean

0.14

PASS

z-std

0.96

PASS

Alternative Targets#

Average Attribute Coefficient (Default)#

# Default: E[beta_k(W)] for the first beta
result = structural_dml(Y, T, X, family='multinomial_logit',
                        n_alternatives=3, n_attributes=2, patience=50)

Choice Probability#

Using the inference() API with ChoiceProbabilityTarget:

from deep_inference import inference
from deep_inference.targets.choice_probability import ChoiceProbabilityTarget

# P(Y=j | W, X) for alternative j=1
target = ChoiceProbabilityTarget(alternative=1, n_alternatives=3, n_attributes=2)
result = inference(Y, T, X, model='multinomial_logit', target=target)

Multinomial Average Marginal Effect#

from deep_inference.targets.choice_probability import MultinomialAME

# dP(Y=j)/dx_{jk}: marginal effect of attribute k on probability of choosing j
target = MultinomialAME(alternative=1, attribute=0, n_alternatives=3, n_attributes=2)
result = inference(Y, T, X, model='multinomial_logit', target=target)

Real-World Applications#

Transportation Mode Choice#

# Y = chosen mode (0=car, 1=bus, 2=train)
# T = packed attributes: [cost_car, time_car, cost_bus, time_bus, cost_train, time_train]
# X = individual characteristics (income, age, ...)
# Target: E[beta_cost(W)] = average price sensitivity

result = structural_dml(Y, T, X, family='multinomial_logit',
                        n_alternatives=3, n_attributes=2, patience=50)

Brand Choice#

# Y = chosen brand (0, 1, ..., J-1)
# T = packed attributes: [price, quality] for each brand
# X = consumer demographics
# Target: E[beta_price(W)] = average price elasticity

result = structural_dml(Y, T, X, family='multinomial_logit',
                        n_alternatives=J, n_attributes=2, patience=50)

Market Entry#

# Y = chosen market (0, 1, ..., J-1)
# T = packed market characteristics: [size, competition] per market
# X = firm characteristics
# Target: E[beta_size(W)] = average market size sensitivity

result = structural_dml(Y, T, X, family='multinomial_logit',
                        n_alternatives=J, n_attributes=2, patience=50)

Key Takeaways#

  1. patience=50 is essential: The default patience=10 triggers early stopping too aggressively with 3-way splitting (only 60% of data for training)

  2. n >= 8000 for coverage: 3-way splitting + 4D theta requires more data than binary logit; n=5000 gives only ~88% coverage

  3. 3-way splitting (Regime C): The Fisher Hessian depends on \(\theta\) (through softmax probabilities), requiring 3-way cross-fitting

  4. Fisher information Hessian: Does not depend on \(Y\), only on the softmax probabilities — this is theoretically correct for the expected Hessian

  5. correction_ratio ~70-90 is normal: Much larger than binary logit (~2) due to higher-dimensional theta and more complex loss surface

  6. \(\alpha_2\) is hardest to recover: The weakest signal (slope -0.1) needs n >= 10000 for reliable correlation > 0.7

Reference#

Hetzenecker, S. & Osterhaus, C. (2024). “Deep Learning for Heterogeneous Parameters in Discrete Choice Models.” arXiv:2408.09560.