and \(\mu\) is the expected number of occurrences, which also equals the variance. Sometimes, you think about \(\mu\) as a rate, e.g. average number of car crashes per 1000 population, per 1000 licensed drivers, per 10000 km travelled etc. In the latter view, rate is in terms of an exposure.
If \(y_i\) denote the number of events over exposure \(n_i\) for the \(i^\text{th}\) covariate pattern, the expected value is
\[
\begin{align*}
\log(\mu_i) &= \log(n_i) + X \beta
\end{align*}
\]
The inclusion of \(\log(n_i)\) is the offset, a known constant whereas the \(X\) and \(\beta\) represent the usual covariate pattern and parameters.
When over-dispersion is present (can indicate something is missing from the linear predictor or independence between observations is questionable) we can switch to a Negative binomial distribution for the likelihood. This has a number of parameterisations, one of which uses the mean directly then controls the overdispersion relative to the square of the mean.
Simulate overdispersed counts associated with 3 vaccine groups and where counts are represented per 35k but our interest is per million.
Simulate overdispersed counts
# we have 20 observations per vax, each having a count and a somewhat distinct # covariate pattern (which I am ignoring for now)n_per_group <-50cells_per_well <-35000# arbitrary valuestrue_rates <-c(vaxA =5, vaxB =10, vaxC =15) / cells_per_well # overdispersion (smaller => more dispersion)phi <-2# Simulate datad_sim <-rbindlist(lapply(names(true_rates), function(vaccine) { mu <- true_rates[vaccine] * cells_per_welldata.table(vaccine = vaccine,response =rnbinom(n_per_group, mu = mu, size = phi),# technically this could be varying for each obscells = cells_per_well )}))d_sim[, group :=as.integer(factor(vaccine))]d_sim[, log_cells :=log(cells)]head(d_sim)
data {int<lower=1> N;int<lower=1> G;array[N] int<lower=1, upper=G> group;array[N] int<lower=0> y;vector[N] log_cells;}parameters {vector[G] b0; // log(rate per cell for each group)real<lower=0> phi; // overdispersion parameter}model {target += normal_lpdf(b0 | 0, 2);target += gamma_lpdf(phi | 2, 0.1);for (i in1:N){target += neg_binomial_2_log_lpmf(y[i] | b0[group[i]] + log_cells[i], phi); } }generated quantities {vector[N] y_rep;vector[G] mu;for (i in1:G){ mu[i] = exp(b0[i]) * pow(10,6); }for (i in1:N){ y_rep[i] = neg_binomial_2_log_rng(b0[group[i]] + log(pow(10,6)), phi); }}
Fit with a stan model, compute the expected values per million in the generated quantities block. The reference values are 143, 286, 429 for groups A, B and C.
Represent the posterior view of the means (per million cells). These are the expected counts, i.e. on average (central tendency) we expect blah from vaccine A, B, C.
Posterior means
d_post <-data.table(f1$draws(variables =c("mu"), format ="matrix"))d_post <-melt(d_post, measure.vars =names(d_post))d_fig_1 <-copy(d_post)d_fig_1[, group :=gsub("mu\\[", "", variable)]d_fig_1[, group :=as.factor(gsub("\\]", "", group))]d_fig_2 <-data.table(group =factor(1:3),tru = true_rates *1e6)ggplot(d_fig_1, aes(x = value, group = group, col = group)) +geom_vline(data = d_fig_2, aes(xintercept = tru, group = group, col = group),lwd =0.2 ) +geom_density() +theme_bw()
Figure 1: Posterior means per million cells by group