Multinomial regression

regression
bayesian
multinomial
Author

maj

Published

October 4, 2024

Modified

October 6, 2024

Multinomial distribution

The multinomial distribution is a generalisation of the binomial distribution, [1]. The binomial counts the successes in a fixed number of trials each classed as success or failure. The multinomial distribution keeps track of trials whose outcomes have multiple categories, e.g. agree, neutral, disagree.

We have \(N\) objects, each is independently placed into one of \(K\) categories. An object is placed in category \(k\) with probability \(p_k\) where \(\sum_{k=1}^K p_k = 1\) and \(p_k \ge 0\) for all \(k\). If we let \(Y_1\) be the count of category 1 objects, \(Y_2\) be the count for category 2 etc so that \(Y_1 + \dots + Y_K = N\) then \(\mathbf{Y} = (Y_1, \dots, Y_K)\) is said to have a multinomial distribution with parameters \(N\) and \(\mathbf{p} = (p_1, \dots, p_K)\). This can be written as \(\mathbf{Y} \sim \text{Mult}_K(N, \mathbf{p})\) where the \(\mathbf{Y}\) is referred to as a random vector as it is a vector of random variables.

If \(\mathbf{Y} \sim \text{Mult}_K(N, \mathbf{p})\) then the joint PMF is:

\[ \begin{aligned} \text{Pr}(Y_1 = y_1, \dots, Y_K = y_K) &= \frac{N!}{y_1!y_2!\dots y_K!} p_1^{y_1}p_2^{y_2}\dots p_K^{y_K} \\ &= \begin{pmatrix} N \\ y_1, \dots, y_K \end{pmatrix} \prod_{k=1}^K p_k^{y_k} \end{aligned} \]

The marginal distribution of a multinomial are all binomial with \(Y_k \sim Bin(N, p_k)\).

Lumping categories together will form multinomial distribution with the revised \(K^\prime\) representing the new (smaller) set of categories and both counts and probabilities for the lumped groups being additive.

Multinomial regression

Multinomial regression is commonly known as multi-logit or multicategory logit or softmax or categorical regression, [2].

In a logistic regression, we pick either of the two outcome states and then produce a single linear predictor for the log-odds. In a categorical regression, there are multiple linear predictors. Typically, the form of the linear predictors is kept the same and the parameters differ for each equation. For example, assuming a single continuous predictor variable, \(x\), we might have:

\[ \begin{aligned} \lambda_k = \beta_{0, k} + \beta_{1,k} x \end{aligned} \]

then convert the \(\lambda_k\) from each linear predictor to a simplex using the softmax function

\[ \begin{aligned} \mathbf{p} = \frac{\exp(\lambda_k)}{\sum_{i=1}^K \exp(\lambda_i)} \end{aligned} \]

which is then used in the multinomial likelihood. This clearly implies the existence of \(K\) linear predictors, but there is an inherent indeterminacy in these equations because a constant \(C_0\) can be added to every \(\beta_{0, k}\) and a constant \(C_1\) can be added to every \(\beta_{1, k}\) and the softmax function will produce the same set of probabilities.

The standard way to address this is by introducing a constraint that selects one of the categories as a pivot and then other categories are modeled relative to the pivot, [3]. In other words, there are still \(K\) linear predictors, but one of them is fixed to have a zero value. Thus, the linear predictor has the form:

\[ \begin{aligned} \log \left( \frac{p_k}{p_r} \right) = \beta_{0, k} + \beta_{1,k} x \end{aligned} \]

where \(p_k\) is the probability of the \(k^{\text{th}}\) category and \(p_r\) is the probability of the reference category1. That is, the log-odds (logit) of category \(k\) relative to \(r\) and so each logit is associated with different effects. The direct interpretation of the parameters is therefore always conditional on the reference category. For example, given \(\log(p_k/p_r)\), a positive coefficient for \(\beta_{1,k}\) indicates that the log odds of being in category \(k\) versus category \(r\) increases for increasing values of \(x\).

In a Bayesian context, the ‘fix to zero’ isn’t strictly required, but some level of constraint will strengthen identifiability. Alternative approaches include (1) using informative priors (2) using sum to zero constraints (3) QR decomposition.

Analyses

Setup

Assume that for a city, it is well known that the dominant modes of travel (to work) are walk/bike, public transport and car with a distribution provided in Table 1. Note that these are mutually exclusive and exhaustive of the possible modes of transport. Now say we want to investigate ways to move people away from cars. Interventions may span from city wide information on the benefits of using public transport, to financial discounts to restrictions on parking or taxes. Assume we want to evaluate whether an information mail-out versus financial discounts on public transport achieves greater transition to public transport.

Table 1: Distribution of dominant mode of transport used to get to work
Mode of transport Proportion
Walk/bike 0.07
Public transport 0.33
Car 0.60
get_data <- function(
    N = 250,
    p0 = c(0.07, 0.33, 0.6),
    p1 = c(0.10, 0.50, 0.4)){
  
  d <- data.table(id = 1:N)
  d[, trt := rep(0:1, each = N/2)]
  
  d[trt == 0, y := sample(seq_along(p0), .N, replace = T, prob = p0)]
  d[trt == 1, y := sample(seq_along(p1), .N, replace = T, prob = p1)]
  
  d
}

We take a random sample of adult working residents from the city. The sample is randomised 1:1 to a monthly email lasting for 3-months that details the benefit of using public transport versus a 3-month discount for the all public transport networks within the city. At 3-months, the sample cohort are surveyed to determine their dominant mode of transport in the last month.

Contingency table

One approach to the analysis is via a contingency table posing the research hypothesis Do dominant modes of transport differ under the two interventions? The Chi-squared test of independence is run on the observed counts of each cell as follows:

\[ \begin{aligned} X^2 &= \sum_{i=1}^r \sum_{j=1}^c \frac{(O_{ij} - e_{ij})^2}{e_{ij}} \end{aligned} \]

where the expected counts are based on the row and column totals (i.e. assuming independence across groups):

\[ \begin{aligned} e_{ij} = \frac{O_{i.} O_{.j}}{O_{..}} \end{aligned} \]

where \(O_{i.}\) denotes the row totals, \(O_{.j}\) the column totals and \(O_{..}\) denotes the grand total.

If the null hypothesis is true then the observed and the expected frequencies will be similar to one another and \(\chi^2\) will be small. If \(\chi^2\) is too large then the null would be rejected suggesting that the treatment and dominant modes of travel are not independent. Specifically, we would reject the null if the test statistic was found to be greater than or equal to a \(\chi^2\) distribution that has \((r-1)(c-1)\) degrees of freedom, i.e. reject null if \(X^2 \ge \chi^2_{(r-1)(c-1);\alpha}\) where \(\alpha\) is the significance level.

A sketch of the approach is shown below.

set.seed(1)
d <- get_data()

# observed
m_obs <- as.matrix(dcast(d[, .N, keyby = .(trt, y)],
                      trt ~ y, value.var = "N"))

tot_cols <- colSums(m_obs[, 2:4])
tot_rows <- rowSums(m_obs[, 2:4])

# expected
m_e <- rbind(
  tot_cols  * tot_rows[1] / sum(tot_rows),
  tot_cols  * tot_rows[2] / sum(tot_rows)
)

# chisqured statistic vs critical value

v_stats <- c(
  chisq_obs = sum(  ((m_obs[, -1] - m_e)^2)/m_e ) ,
  chisq_crit = qchisq(0.95, 2 * 1)
)

round(c(
  v_stats, 
  p_value = pchisq(v_stats[1], 2 * 1, lower.tail = F)), 3)
        chisq_obs        chisq_crit p_value.chisq_obs 
            6.053             5.991             0.048 

Since the observed test statistic is greater than the critical value, we would reject the null in this case. The above can be automatically with the built-in function:

res <- chisq.test(m_obs[, -1])
res

    Pearson's Chi-squared test

data:  m_obs[, -1]
X-squared = 6.0528, df = 2, p-value = 0.04849

Multi-logit regression

To implement the multi-logit regression a long format is usually adopted so that we have each unit \(i\) has an outcome \(Y_i\) equal to one of the possible categories.

A simple sum-to-zero implementation is shown below.

data {
  // Each unit has an outcome corresponding to one of K categories
  int K;
  // Number of units
  int N;
  // Number of terms in linear predictor (all lp have the 
  // same terms here). Includes intercept.
  int D;
  // Outcome variable (one of the categories)
  array[N] int y;
  // design matrix
  matrix[N, D] x;
}
parameters {
  vector[K-1] a_raw;
  vector[K-1] b_raw;
}
transformed parameters {
  //            k1  k2  k3 etc
  // intercept  -   -   -
  //         x  -   -   -
  matrix[D, K] b;
  
  b[1, ] = append_row(a_raw, -sum(a_raw))';
  b[2, ] = append_row(b_raw, -sum(b_raw))';
}
model {
  matrix[N, K] x_beta = x * b;

  to_vector(a_raw) ~ normal(0, 5);
  to_vector(b_raw) ~ normal(0, 5);

  for (n in 1:N) {
    y[n] ~ categorical_logit(x_beta[n]');
  }
}
generated quantities{
  
  matrix[K, 2] p;
  
  vector[K] l0 = to_vector(b[1, ]);
  vector[K] l1 = to_vector(b[1, ] + b[2, ]);
  
  p[, 1] = softmax(l0);
  p[, 2] = softmax(l1);
  
}

And a fixed pivot implementation is

data {
  // Each unit has an outcome corresponding to one of K categories
  int K;
  // Number of units
  int N;
  // Number of terms in linear predictor (all lp have the 
  // same terms here). Includes intercept.
  int D;
  // Outcome variable (one of the categories)
  array[N] int y;
  // design matrix
  matrix[N, D] x;
}
parameters {
  //            k1  k2  k3 etc
  // intercept  -   -   -
  //         x  -   -   -
  matrix[D, K-1] b_raw;
}
transformed parameters {
  //            k1  k2  k3 etc
  // intercept  -   -   -
  //         x  -   -   -
  matrix[D, K] b;
  
  b[, 1:(K-1)] = b_raw;
  b[, K] = rep_vector(0.0, D);
}
model {
  matrix[N, K] x_beta = x * b;

  to_vector(b_raw) ~ normal(0, 5);

  for (n in 1:N) {
    y[n] ~ categorical_logit(x_beta[n]');
  }
}
generated quantities{
  
  matrix[K, 2] p;
  
  vector[K] l0 = to_vector(b[1, ]);
  vector[K] l1 = to_vector(b[1, ] + b[2, ]);
  
  p[, 1] = softmax(l0);
  p[, 2] = softmax(l1);
  
}

Fit both models to the data

m1 <- cmdstanr::cmdstan_model("stan/multi-logit-01.stan")
m2 <- cmdstanr::cmdstan_model("stan/multi-logit-02.stan")

ld <- list(
  N = nrow(d),
  K = length(unique(d$y)),
  y = d$y,
  D = 2,
  x = cbind(1, d$trt)
)

f1 <- m1$sample(data = ld, chains = 1, iter_sampling = 1000, refresh = 0)
f2 <- m2$sample(data = ld, chains = 1, iter_sampling = 1000, refresh = 0)
Running MCMC with 1 chain...

Chain 1 finished in 0.9 seconds.
Running MCMC with 1 chain...

Chain 1 finished in 0.6 seconds.

Both approaches faithfully recover the empirical proportions for the two treatment groups as shown in Table 2. The pivot model gives the most direct path to interpreting the model parameters. However, note that the models are structurally different and cannot be considered completely equivalent since different prior weights enter the models.

Table 2: Observed and modeled distribution for mode of transport
trt y N tot p mu_f1 mu_f2
0 1 7 125 0.056 0.057 0.056
0 2 46 125 0.368 0.366 0.370
0 3 72 125 0.576 0.577 0.574
1 1 12 125 0.096 0.097 0.096
1 2 60 125 0.480 0.479 0.479
1 3 53 125 0.424 0.424 0.426

The parameter estimates from the linear predictors from the two models are summarised below. The terms (treatment effects) of interest are those associated with the second model f2, specifically, b[2,1] and b[2,2]. These suggesting that (1) units in the treatment group are more likely to have walking/bike than car as their dominant mode of transport and (2) units in the treatment group are more likely to have public transport than car as their dominant mode of transport.

f1$summary(variables = "b")
# A tibble: 6 × 10
  variable    mean median    sd   mad     q5    q95  rhat ess_bulk ess_tail
  <chr>      <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 b[1,1]   -1.43   -1.43  0.271 0.258 -1.88  -0.994 1.00      201.     327.
2 b[2,1]    0.382   0.391 0.339 0.345 -0.165  0.925 1.00      195.     366.
3 b[1,2]    0.488   0.483 0.162 0.158  0.227  0.759 1.00      198.     232.
4 b[2,2]    0.0985  0.102 0.213 0.208 -0.237  0.432 1.01      230.     367.
5 b[1,3]    0.946   0.941 0.163 0.156  0.683  1.21  0.999     416.     625.
6 b[2,3]   -0.481  -0.481 0.214 0.216 -0.830 -0.123 1.00      388.     375.
f2$summary(variables = "b")
# A tibble: 6 × 10
  variable   mean median    sd   mad      q5     q95  rhat ess_bulk ess_tail
  <chr>     <dbl>  <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>    <dbl>    <dbl>
1 b[1,1]   -2.40  -2.38  0.419 0.423 -3.17   -1.76    1.00     353.     491.
2 b[2,1]    0.875  0.868 0.538 0.549  0.0177  1.76    1.01     369.     376.
3 b[1,2]   -0.443 -0.445 0.196 0.181 -0.769  -0.0980  1.00     491.     608.
4 b[2,2]    0.562  0.555 0.275 0.277  0.112   1.02    1.00     506.     600.
5 b[1,3]    0      0     0     0      0       0      NA         NA       NA 
6 b[2,3]    0      0     0     0      0       0      NA         NA       NA 

These results are consistent with the results from the contingency table analysis in that they suggest the public transport discount intervention may increase the likelihood of people taking this form of transport to work.

Unlike the contingency analysis, the the multi-logit approach offers the usual regression benefit of being able to adjust for covariates that may be relevant.

Poisson regression

Multinomial logit models can also be fit using an equivalent log-linear model and a series of Poisson likelihoods. This is mathematically equivalent to the multinomial and computationally the Poisson approach can be easier, see McElreath2020 p365.

An implementation is shown below

data {
  // Each unit has an outcome corresponding to one of K categories
  int K;
  // Number of units
  int N;
  // Outcome variable (one of the categories)
  // y1= walk/bike, y2=public transport, y3=car
  array[N] int y1;
  array[N] int y2;
  array[N] int y3;
  // design matrix
  vector[N] x;
}
parameters {
  vector[K] a;
  vector[K] b;
}
model {

  to_vector(a) ~ normal(0, 5);
  to_vector(b) ~ normal(0, 5);
  
  target += poisson_log_lpmf(y1 | a[1] + x * b[1]);
  target += poisson_log_lpmf(y2 | a[2] + x * b[2]);
  target += poisson_log_lpmf(y3 | a[3] + x * b[3]);
  
}
generated quantities{
  
  matrix[K, 2] p;
  
  p[1, 1] = exp(a[1]) * inv( exp(a[1]) + exp(a[2]) + exp(a[3]));
  p[2, 1] = exp(a[2]) * inv( exp(a[1]) + exp(a[2]) + exp(a[3]));
  p[3, 1] = exp(a[3]) * inv( exp(a[1]) + exp(a[2]) + exp(a[3]));
  
  p[1, 2] = exp(a[1] + b[1]) * inv( exp(a[1] + b[1]) + exp(a[2] + b[2]) + exp(a[3] + b[3]));
  p[2, 2] = exp(a[2] + b[2]) * inv( exp(a[1] + b[1]) + exp(a[2] + b[2]) + exp(a[3] + b[3]));
  p[3, 2] = exp(a[3] + b[3]) * inv( exp(a[1] + b[1]) + exp(a[2] + b[2]) + exp(a[3] + b[3]));

}

In order to fit the model, it is necessary to first create an indicator variable for the occurrence of each category.

# need to create a binary indicator for each category:
d[, y1 := ifelse(y == 1, 1, 0)]
d[, y2 := ifelse(y == 2, 1, 0)]
d[, y3 := ifelse(y == 3, 1, 0)]

m3 <- cmdstanr::cmdstan_model("stan/poisson-01.stan")

ld <- list(
  N = nrow(d),
  K = length(unique(d$y)),
  y1 = d$y1, 
  y2 = d$y2,
  y3 = d$y3,
  x = d$trt
)

f3 <- m3$sample(data = ld, chains = 1, iter_sampling = 1000, refresh = 0)
Running MCMC with 1 chain...

Chain 1 finished in 0.4 seconds.
f3$summary(variables = "b")
# A tibble: 3 × 10
  variable   mean median    sd   mad      q5     q95  rhat ess_bulk ess_tail
  <chr>     <dbl>  <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>    <dbl>    <dbl>
1 b[1]      0.528  0.532 0.482 0.445 -0.280   1.36   1.00      630.     485.
2 b[2]      0.258  0.257 0.196 0.200 -0.0624  0.586  1.00      640.     662.
3 b[3]     -0.309 -0.302 0.180 0.173 -0.608  -0.0170 0.999     683.     607.

The combined set of results are shown below, again the empirical proportions for each group are captured by the poisson implementation of the model.

Table 3: Observed and modeled distribution for mode of transport (multi-logit and poisson)
trt y N tot p mu_f1 mu_f2 mu_f3
0 1 7 125 0.056 0.057 0.056 0.057
0 2 46 125 0.368 0.366 0.370 0.370
0 3 72 125 0.576 0.577 0.574 0.573
1 1 12 125 0.096 0.097 0.096 0.095
1 2 60 125 0.480 0.479 0.479 0.480
1 3 53 125 0.424 0.424 0.426 0.424

Extensions

One of the natural extensions of multinomial regression is the use of cluster level effects that are correlated across categories. For more detail, see [4].

References

1. Blitzstein J. Introduction to probability. London: CRC Press; 2019.
2. Agresti A. Introduction to categorical data analysis. New Jersey: Wiley; 2019.
3. McElreath R. Statistical rethinking. London: CRC Press; 2020.
4. Koster J, McElreath R. Multinomial analysis of behavior: Statistical methods. Behav Ecol Sociobiol. 2017;71.

Footnotes

  1. Note that if we have three categories and assign category 3 as the reference then even though our logits are in terms of \(\log(p_1/p_3)\) and \(\log(p_2/p_3)\), we can still compute a logit for \(\log(p_1/p_2)\) by subtracting the logit for \(\log(p_2/p_3)\) from that for \(\log(p_1/p_3)\).↩︎