A newer version of CmdStan is available. See ?install_cmdstan() to install it.
To disable this check set option or environment variable cmdstanr_no_ver_check=TRUE.
library(data.table)library(ggplot2)
Hamiltonian Monte Carlo is usually used to estimate continuous parameters. However, sometimes we are interested in discrete parameters.
HMC runs on log-probabilities. Therefore, as long as we can increment the log-probability associated with our model (whatever the model is), we can code it up, even if it uses discrete parameters. For models with discrete parameter, we have to ‘marginalise out’ the discrete parameter(s) and increment the log-density by the marginal density.
Ben Lambert has an example:
Consider a series of \(K\) experiments where we get someone flips a coin a fixed number of times, \(n\), for each experiment, but we are never told \(n\). The individual uses the same coin across all the experiments and the probability of heads remains constant \(\text{Pr}(X = \text{heads}) = \theta\). We define \(X_k\) as the number of heads obtained in each experiment \(k\) yielding \((X_1, X_2, \dots X_K)\). Again, both \(\theta\) and \(n\) are unknown to us; \(\theta\) is continuous, but \(n\) is discrete.
We want to be able to make inference on both \(\theta\) and \(n\), i.e. talk about the probability intervals for theta and the probability that \(n\) takes on certain values.
Assume that we know \(K = 10\) experiments were run and the following data was observed \(X_k = (2, 4, 3, 3, 3, 3, 3, 3, 4, 4)\).
We adopt independent priors on \(\theta\) and \(n\):
We can write down the joint posterior of \(\theta\) and \(n\), i.e. \(\text{Pr}(\theta, n | X)\) and then marginalise out the discrete parameter, \(n\). Once we have an expression that excludes the \(n\), then we can get stan to use that expression to conduct the sampling we want it to do. We have:
The first term on the RHS is the likelihood for which we use the binomial distribution.
The second term is the log of the discrete uniform distribution that we defined earlier. For a given \(n \in \{ 5,6,7,8 \}\) this is \(\log(1/4)\) as we assume each option has equal probability.
Finally, the third term is standard uniform. However, given that the third term does not contain \(n\), we do not actually need to include it in the expression for the joint distribution (although we do still include it in the model block as a standard uniform).
The above allows us to estimate models with discrete parameters by marginalising them out of the joint density. But, what if we want to do inference on the discrete parameters?
Answer; write down the unnormalised density of \(n\) conditional on \(X\) and estimate via MCMC:
where \(B\) is the number of MCMC samples and \(\theta_i\) are the posterior samples. Essentially, this is averaging over \(\theta_i\).
To get the normalised version, we need to form a simplex (sum of elements is 1 and elements are non-negative and less than 1) across the four possible values for \(n\) giving probabilities for \(n = 5\), \(n = 6\), \(n = 7\) and \(n = 8\). We can obtain this from:
An implementation for the above discussion is shown below:
data {// num exptint<lower=0> K;array[K] int X;}transformed data{array[4] int n;// these are the permissible values of n, // i.e. 5, 6, 7, 8for(i in1:4){ n[i] = 4 + i; }}parameters {real<lower=0, upper=1> theta;}transformed parameters{// unnormalised densityvector[4] lq;for(i in1:4){// record the unnormalised density for every possible value // of n// log pmf for the array of values in X conditional on a given n[i]// and the parameter theta PLUS the prior on the given n, which is // a discrete uniform, i.e. for each n, the prior is 0.25 lq[i] = binomial_lpmf(X | n[i], theta) + log(0.25); }}model {target += uniform_lpdf(theta | 0, 1);// marginalise out the troublesome ntarget += log_sum_exp(lq);}generated quantities{// probability of n given X, i.e. the distribution of n | xvector[4] p_n_X; p_n_X = exp(lq - log_sum_exp(lq));}
Running the model with the assumed data gives us our parameter estimates for both \(\theta\) and \(n\).
Obviously, the above is somewhat contrived. We gnerally do not know the bounds on the discrete parameters. For example, how did we know that the bounds of \(n\) were 5 and 8? How would we have modified the model to account for observing a 6 in the data?