# Multinomial Logit (Conditional Logit) Tutorial The multinomial logit model handles discrete choice among $J \geq 3$ alternatives with heterogeneous preferences. ## Training Configuration Multinomial logit requires more patient training than binary models due to 3-way cross-fitting splitting: ```python result = structural_dml(Y, T, X, family='multinomial_logit', n_alternatives=3, n_attributes=2, patience=50, epochs=300) ``` **Key settings:** - `patience=50` (default 10 is too aggressive for 3-way splitting) - `epochs=300` (needs more epochs to converge) - `n >= 8000` recommended for valid coverage ## When to Use Use the multinomial logit model when: - Individual chooses one from $J \geq 3$ discrete alternatives - Alternatives have measurable attributes (price, quality, etc.) - Parameters vary across individuals (heterogeneous preferences) - Examples: transportation mode choice, brand selection, product choice ## Mathematical Setup ### Data Generating Process $$V_{ij} = \alpha_j(W) + X'_{ij} \cdot \beta(W)$$ $$P(Y=j \mid W, X) = \frac{\exp(V_{ij})}{\sum_m \exp(V_{im})}$$ Where: - $W$ = individual characteristics (covariates, NN input) - $X_{ij}$ = attributes of alternative $j$ for individual $i$ - $\alpha_j(W)$ = alternative-specific intercepts ($\alpha_0 = 0$, normalized) - $\beta(W)$ = attribute coefficients (common across alternatives) ### Estimand $$\mu^* = E[\beta_k(W)]$$ The average attribute coefficient across the population. For example, the average price sensitivity. ### Loss Function $$L(Y, X, \theta) = -\log P(Y=y \mid W, X) = -\log \text{softmax}(V)[y]$$ Categorical cross-entropy loss. ### Influence Score Components | Component | Formula | |-----------|---------| | Score $\partial\ell/\partial\alpha_j$ | $p_j - \mathbf{1}\{Y=j\}$ (for $j=1,\ldots,J-1$) | | Score $\partial\ell/\partial\beta_k$ | $\sum_j (p_j - \mathbf{1}\{Y=j\}) x_{jk}$ | | Hessian (Fisher) | $H_{\alpha\alpha}[j,m] = p_{j+1}(\delta_{jm} - p_{m+1})$ | | Hessian $H_{\beta\beta}$ | $\sum_j p_j (x_j - \bar{x}_p)(x_j - \bar{x}_p)'$ | Note: The Hessian is the **Fisher information** (does not depend on $Y$), but depends on $\theta$ through the softmax probabilities. ## Data Encoding Multinomial logit has a unique data layout compared to other families: | Variable | Shape | Description | |----------|-------|-------------| | `W` (passed as `X`) | `(n, d_w)` | Individual characteristics (NN input) | | `T` | `(n, J*K)` | Packed alternative attributes | | `Y` | `(n,)` | Chosen alternative index (float: 0, 1, ..., J-1) | **Parameter vector:** $\theta = [\alpha_1, \ldots, \alpha_{J-1}, \beta_1, \ldots, \beta_K]$, so `theta_dim = (J-1) + K`. **Treatment encoding:** Alternative attributes are packed as `(n, J*K)` — each row contains the $K$ attributes for all $J$ alternatives concatenated. Inside the model, this is reshaped to `(n, J, K)`. ## Complete Example ```python import numpy as np import torch from deep_inference import structural_dml # === DGP Setup === J = 3 # alternatives K = 2 # attributes per alternative d_w = 3 # individual characteristics n = 8000 np.random.seed(42) # Individual characteristics W ~ N(0, I) W = np.random.normal(0, 1, (n, d_w)) # True parameter functions (heterogeneous in W[0]) # alpha_0 = 0 (reference), alpha_1 = 0.5 + 0.2*W[0], alpha_2 = -0.3 - 0.1*W[0] # beta_1 = -0.8 - 0.2*W[0], beta_2 = 0.5 + 0.1*W[0] a0 = [0.0, 0.5, -0.3] a1 = [0.0, 0.2, -0.1] b0 = [-0.8, 0.5] b1 = [-0.2, 0.1] alphas = np.column_stack([a0[j] + a1[j] * W[:, 0] for j in range(J)]) betas = np.column_stack([b0[k] + b1[k] * W[:, 0] for k in range(K)]) # Alternative attributes X ~ N(0, 1) X_alt = np.random.normal(0, 1, (n, J, K)) # Utilities V_ij = alpha_j + x'_ij * beta from scipy.special import softmax V = alphas.copy() for j in range(J): V[:, j] += np.sum(X_alt[:, j, :] * betas, axis=1) probs = softmax(V, axis=1) # Sample choices Y = np.array([np.random.choice(J, p=probs[i]) for i in range(n)]).astype(float) # Pack T: (n, J, K) -> (n, J*K) T = X_alt.reshape(n, -1) # True target: mu* = E[beta_1(W)] = b0[0] = -0.8 mu_true = b0[0] print(f"True mu* = {mu_true}") print(f"Y distribution: {[f'{(Y==j).mean():.1%}' for j in range(J)]}") print(f"Shapes: Y={Y.shape}, T={T.shape}, W={W.shape}") # === Run Inference === result = structural_dml( Y=Y, T=T, X=W, family='multinomial_logit', n_alternatives=J, n_attributes=K, hidden_dims=[64, 32], epochs=300, patience=50, n_folds=50, lr=0.01 ) print(result.summary()) ``` ## Expected Results From [Eval 09: Multinomial Logit](../validation/eval_09.md): ### Parameter Recovery | Component | RMSE | Correlation | Status | |-----------|------|-------------|--------| | $\alpha_1$ | 0.08 | 0.90 | PASS | | $\alpha_2$ | 0.12 | 0.78 | PASS | | $\beta_1$ | 0.09 | 0.88 | PASS | | $\beta_2$ | 0.10 | 0.85 | PASS | ### Coverage (M=50) | Metric | Value | Status | |--------|-------|--------| | Coverage | 98% | PASS | | SE Ratio | 0.97 | PASS | | z-mean | 0.14 | PASS | | z-std | 0.96 | PASS | ## Alternative Targets ### Average Attribute Coefficient (Default) ```python # Default: E[beta_k(W)] for the first beta result = structural_dml(Y, T, X, family='multinomial_logit', n_alternatives=3, n_attributes=2, patience=50) ``` ### Choice Probability Using the `inference()` API with `ChoiceProbabilityTarget`: ```python from deep_inference import inference from deep_inference.targets.choice_probability import ChoiceProbabilityTarget # P(Y=j | W, X) for alternative j=1 target = ChoiceProbabilityTarget(alternative=1, n_alternatives=3, n_attributes=2) result = inference(Y, T, X, model='multinomial_logit', target=target) ``` ### Multinomial Average Marginal Effect ```python from deep_inference.targets.choice_probability import MultinomialAME # dP(Y=j)/dx_{jk}: marginal effect of attribute k on probability of choosing j target = MultinomialAME(alternative=1, attribute=0, n_alternatives=3, n_attributes=2) result = inference(Y, T, X, model='multinomial_logit', target=target) ``` ## Real-World Applications ### Transportation Mode Choice ```python # Y = chosen mode (0=car, 1=bus, 2=train) # T = packed attributes: [cost_car, time_car, cost_bus, time_bus, cost_train, time_train] # X = individual characteristics (income, age, ...) # Target: E[beta_cost(W)] = average price sensitivity result = structural_dml(Y, T, X, family='multinomial_logit', n_alternatives=3, n_attributes=2, patience=50) ``` ### Brand Choice ```python # Y = chosen brand (0, 1, ..., J-1) # T = packed attributes: [price, quality] for each brand # X = consumer demographics # Target: E[beta_price(W)] = average price elasticity result = structural_dml(Y, T, X, family='multinomial_logit', n_alternatives=J, n_attributes=2, patience=50) ``` ### Market Entry ```python # Y = chosen market (0, 1, ..., J-1) # T = packed market characteristics: [size, competition] per market # X = firm characteristics # Target: E[beta_size(W)] = average market size sensitivity result = structural_dml(Y, T, X, family='multinomial_logit', n_alternatives=J, n_attributes=2, patience=50) ``` ## Key Takeaways 1. **patience=50 is essential**: The default patience=10 triggers early stopping too aggressively with 3-way splitting (only 60% of data for training) 2. **n >= 8000 for coverage**: 3-way splitting + 4D theta requires more data than binary logit; n=5000 gives only ~88% coverage 3. **3-way splitting (Regime C)**: The Fisher Hessian depends on $\theta$ (through softmax probabilities), requiring 3-way cross-fitting 4. **Fisher information Hessian**: Does not depend on $Y$, only on the softmax probabilities — this is theoretically correct for the expected Hessian 5. **correction_ratio ~70-90 is normal**: Much larger than binary logit (~2) due to higher-dimensional theta and more complex loss surface 6. **$\alpha_2$ is hardest to recover**: The weakest signal (slope -0.1) needs n >= 10000 for reliable correlation > 0.7 ## Reference Hetzenecker, S. & Osterhaus, C. (2024). "Deep Learning for Heterogeneous Parameters in Discrete Choice Models." *arXiv:2408.09560*.