Consider the standard one factor factorial setup where we have a grand mean (as in the mean of the factor level means) and deviations for main effects:
It might be tempting setup a design matrix with a column for every parameter, but the OLS solution involves inverting \(X^\top X\) which you cannot do as the rank of \(X\) is less than the number of columns in \(X\):
X <-cbind(1, diag(3))# in ols, the estimator involves inverting the design matrixtryCatch( {solve(t(X)%*%X) } ,error =function(z){message(z, "\n") })
Error in solve.default(t(X) %*% X): system is computationally singular: reciprocal condition number = 1.38778e-17
however, if we use the intercept as the reference group (or introduce a sum to zero constraint) then we are fine
X <-cbind(1, diag(3)[,2:3])# in ols, the estimator involves inverting the design matrixsolve(t(X)%*%X)
Contrasts define a specific linear combination of parameters and affect their interpretation. When treatment contrasts are adopted with a design matrix that includes main and interaction effects for \(a\) with 3 levels and \(b\) with 2 levels we have
a <-factor(1:3)b <-factor(1:2)d <-CJ(a, b)X <-model.matrix(~a * b, data = d)X
For treatment contrasts, the intercept represents the mean of the reference group, taken to be when \(a\) and \(b\) are both at their first level. For deviation contrasts the intercept becomes the grand mean, equating to the mean of the group means, not the mean value of the outcome in the dataset.
With classic ANOVA, a sum to zero constraint allows us to overcome the rank deficient nature of the design matrix.
When using the treatment contrasts, the second and third levels of \(a\) are explicit in the design matrix whereas for the deviation contrasts, the first and second levels appear. For example, with a single factor model:
K_a <-4# means at levels of ab_a <-c(-1, 3, 2, 1)N <-1e6a <-sample(1:K_a, size = N, replace = T, prob =c(0.3, 0.2, 0.4, 0.1))d <-data.table(a =factor(a))d[, mu := b_a[a]]d[, y :=rnorm(.N, mu, 1)]# sum to zero/deviation contrasts as used for anovacontrasts(d$a) <-contr.sum(K_a)f1 <-lm(y ~ a, data = d)
The coefficients include an intercept and deviations for the first three levels of \(a\)
For the deviation contrasts, the mean for the first level of \(a\) is computed as the grand mean, plus the first parameter estimate and similarly for the second and third. The mean for the last level of \(a\) is implied as the grand mean minus the sum of the first three terms, hence sum to zero.
d_new <-CJ(a =factor(1:K_a))contrasts(d_new$a) <-contr.sum(K_a)X <-model.matrix(~a, data = d_new)X
and \(\sum_{i=1}^{K_a} \alpha_i = 0\) which we can derive empirically by calculating the all K_a group means, subtracting the grand mean from each and taking the sum:
(mu_a_hat <-t(model.matrix(~a, data = d_new) %*%coef(f1)))
The above creates an inconvenience for the Bayesian who would like to (i) play frequentist in their insistence with differentiating fixed from random effects and (ii) place equal prior weights to each group of parameters. We could (and often do) respond to this by fixing the reference parameter at zero to overcome the identification issue, but that also puts a different prior weight on this reference parameter in relation to the other parameters within the group. An alternative to arbitrarily fixing one of the factor level parameters to zero is to project the \(K_a\) dimension parameter space associated with the first factor into a \(K_a - 1\) dimensional space in a way that will enforce the same marginal prior on all \(K_a\) terms.
The way that the sum-to-zero constraint is achieved is by placing a prior that has a negative correlation across the effects:
where again we are just considering a single factor at the moment. As a test, we can generate draws from a multivariate normal distribution using this covariance matrix and observe that they will sum to zero:
u <- mvtnorm::rmvnorm(3, rep(0, K_a), S_a)# final column is the sum of the draws:round(cbind(u, rowSums(u)), 3)
\(\Sigma_a\) is not full rank, but it can be decomposed using its set of the eigenvectors. Since \(\Sigma_a\) is rank \(K_a - 1 = 2\), there will only be two non-zero eigenvalues. Both of the eigenvalues are equal to 1. Below, we construct \(\Sigma_a\) as the product of \(Q_a\) and \(\Lambda\) where \(Q_a\) is a matrix formed from \(K_a - 1\) eigenvectors that correspond to the non-zero eigenvalues of \(\Sigma_a\). In this setting, \(\Lambda = \mathbb{I}_{K_a - 1}\).
# for symmetric matrices, svd equiv to eigen# eigen values are sorted in decreasing order so we drop the last onesvd_S_a <-svd(S_a)# eigenvectorsQ_a <- svd_S_a$vQ_a <- Q_a[, 1:(K_a -1)]# original setup for S_aQ_a %*%diag(K_a -1) %*%t(Q_a)
The \(Q_a\) defines the contrasts that are orthonormal and allow us to identify the \(K_a - 1\) parameters.
Armed with the above, we can now take our original tempting setup for the design matrix (i.e. with a column for every parameter) and convert it to a new design matrix that maps \(\alpha^*\) into the observations:
To go from the parameters in their original space to the parameters in the reduced dimensional space and back again:
Code
# original parameters constrained to sum to zerob_a <-c(-1, 0.75, 0.25)# representation in lower dimensional spaceb_a_star <-t(Q_a) %*% b_a# transformed back to b_aX_star %*% b_a_star
[,1]
[1,] -1.00
[2,] 0.75
[3,] 0.25
An implementation for the single factor case follows:
get_data <-function(N =100,K_a =3,mu =1,b_a =c(-1, 0.25, 0.75) ){ a =sample(1:K_a, size = N, replace = T) d <-data.table(a) d[, eta := mu + b_a[a]] d[, y :=rbinom(N, 1, plogis(eta))]# saturated design matrix X_a <-diag(K_a)# correlation matrix to enforce sum to zero J <-matrix(1, K_a, K_a) S_a <-diag(K_a) - J / K_a # decomposition# eigen vectors Q_a <-eigen(S_a)$vector[, 1:(K_a -1)]# full rank design X_a_s <- X_a %*% Q_alist(d = d, K_a = K_a, mu = mu, b_a = b_a,X_a = X_a, X_a_s = X_a_s,S_a = S_a, Q_a = Q_a )}
Logistic regression
// Model for:// independent estimates of treatment arm by subgroup meansdata {int N;array[N] int y;// arm indexarray[N] int x1trt;// dimension of design matrixint ncX1des;int nrX1des;matrix[nrX1des, ncX1des] X1des;vector[ncX1des] sx1;int prior_only;}transformed data {// build full design matricesmatrix[N, ncX1des] X1 = X1des[x1trt];}parameters{real mu;vector[ncX1des] bx1;}transformed parameters{ vector[N] eta = mu + X1*bx1;}model{target += logistic_lpdf(mu | 0, 1);target += normal_lpdf(bx1 | 0, sx1);if(!prior_only){target += bernoulli_logit_lpmf(y | eta); }}generated quantities{}
Code
m1 <- cmdstanr::cmdstan_model("stan/note-09-logistic.stan")l1 <-get_data(N =1e4, K_a =5, mu =0.5, b_a =c(-2, -1, 0, 1, 2) )# would be more efficient to transform into a binomial likelihood# but keeping bernoulli for now since it allows me to think about# unit level data (not sure if that will be necessary though)ld <-list(N =length(l1$d$y),y = l1$d$y,# x1trt = l1$d$a,nrX1des =nrow(l1$X_a_s), ncX1des =ncol(l1$X_a_s),X1des = l1$X_a_s,sx1 =rep(1, ncol(l1$X_a_s)),prior_only =1)f1 <- m1$sample( ld, iter_warmup =1000, iter_sampling =3000,parallel_chains =2, chains =2, refresh =0, show_exceptions = F,max_treedepth =10)
Running MCMC with 2 parallel chains...
Chain 1 finished in 7.5 seconds.
Chain 2 finished in 7.5 seconds.
Both chains finished successfully.
Mean chain execution time: 7.5 seconds.
Total execution time: 7.6 seconds.
To transform the parameter estimates back to the group means that we are interested in, we need to compute the product of the parameters and the unique entries from the design matrix.
In theory, these should all have the same prior weight, which appears to be the case as shown below, Figure 1:
Figure 3: Parameter estimates for group level offsets
Two-factor setup
Here we have two main effects and their associated interactions. I have kept this first implementation as a record of my initial mistake on columnwise vs rowwise specification of the interaction parameters. The implementation in the final section is the correct approach.
Revised data generation function
N =100K_a =3K_b =2b0 =1# effects are sum to zerob_a =c(0, -1, 1)b_b =c(-0.6, 0.6)b_ab =c(-0.05, 0.15, -0.1, 0.05, -0.15, 0.1)# b_ab = c(-0.05, 0.05, 0.15, -0.15, 0.1, -0.1)get_data <-function(N =100,K_a =3,K_b =2,b0 =1,# effects are sum to zerob_a =c(0, -1, 1),b_b =c(-0.6, 0.6),b_ab =c(-0.05, 0.15, -0.1, 0.05, -0.15, 0.1)# b_ab = c(-0.05, 0.05, 0.15, -0.15, 0.1, -0.1) ){stopifnot(all.equal(0, sum(b_a), tolerance =sqrt(.Machine$double.eps)))stopifnot(all.equal(0, sum(b_b), tolerance =sqrt(.Machine$double.eps)))stopifnot(all.equal(0, sum(b_ab), tolerance =sqrt(.Machine$double.eps)))# cols and rows should sum to zero otherwise following will not work b_ab_matrix <-matrix(b_ab, nrow = K_a, ncol = K_b)stopifnot(all.equal(colSums(b_ab_matrix), rep(0, K_b),tolerance =sqrt(.Machine$double.eps)))stopifnot(all.equal(rowSums(b_ab_matrix), rep(0, K_a),tolerance =sqrt(.Machine$double.eps)))# we could go down this avenue but the original parameters would never be # recovered.# b_ab_centered <- b_ab_matrix -# rowMeans(b_ab_matrix) -# rep(colMeans(b_ab_matrix), each = K_a) +# mean(b_ab_matrix)# b_ab_centered <- as.vector(b_ab_centered) a =sample(1:K_a, size = N, replace = T) b =sample(1:K_b, size = N, replace = T) d <-data.table(a, b) d[, b0 := b0] d[, b_a := b_a[a]] d[, b_b := b_b[b]] d[, ix_b_ab := a + (K_a * (b -1))] d[, b_ab := b_ab[ix_b_ab]] d[, eta := b0 + b_a + b_b + b_ab]# e.g. # > unique(d[order(b, a)])# a b b0 b_a b_b b_ab eta# <int> <int> <num> <num> <num> <num> <num># 1: 1 1 1 0 -0.6 -0.2 0.2# 2: 2 1 1 -1 -0.6 0.0 -0.6# 3: 3 1 1 1 -0.6 0.6 2.0# 4: 1 2 1 0 0.4 -0.4 1.0# 5: 2 2 1 -1 0.4 0.2 0.6# 6: 3 2 1 1 0.4 -0.2 2.2 cols_a <-paste0("a", 1:K_a) cols_b <-paste0("a", 1:K_b) cols_c <-unique(d[order(b, a), paste0("a",a,"b",b)]) d[, y :=rbinom(N, 1, plogis(eta))]# saturated design matrix X_a <-diag(K_a)colnames(X_a) <- cols_a X_b <-diag(K_b)colnames(X_b) <- cols_b X_ab <-diag(K_a * K_b)colnames(X_ab) <- cols_c# correlation matrix to enforce sum to zero S_a <-diag(K_a) - (1/ K_a ) S_b <-diag(K_b) - (1/ K_b ) S_ab <-kronecker(S_a, S_b)# should be rank 2 for a 1:3, b 1:2# pracma::Rank(S_ab)# decomposition eigen vectors Q_a <-eigen(S_a)$vector[, 1:(K_a -1)] Q_b <-eigen(S_b)$vector[, 1:(K_b -1)]# Q_ab <- kronecker(Q_a, Q_b)# Ok, Q_ab as defined above will not allow me to recover the original# paramters. Why this is I do not know. The following is a workaround.# All we are doing is building a matrix that will sum up the # columns and the rows of the interaction parameters.# We will then set that to zero and solve (i.e. compute the# null space). C_ab <-construct_C_ab(K_a, K_b)# Calculate the null space of C_ab# Nullspace of C_ab gives the set of vectors st each vector# in Q_ab results in C_ab v = 0 Q_ab <- pracma::nullspace(C_ab)# transformed pars b_a_s <-t(Q_a) %*% b_a b_b_s <-t(Q_b) %*% b_b b_ab_s <-t(Q_ab) %*% b_ab# full rank design X_a_s <- X_a %*% Q_a X_b_s <- X_b %*% Q_b X_ab_s <- X_ab %*% Q_ab X_full_s <-create_full_design(X_a_s, X_b_s, X_ab_s)# round(X_full_s, 3)# Checkstopifnot(all.equal(as.numeric(X_a_s %*% b_a_s), b_a, tolerance =sqrt(.Machine$double.eps)))stopifnot(all.equal(as.numeric(X_b_s %*% b_b_s), b_b, tolerance =sqrt(.Machine$double.eps)))stopifnot(all.equal(as.numeric(X_ab_s %*% b_ab_s), b_ab, tolerance =sqrt(.Machine$double.eps)))list(d = d, b0 = b0, b_a = b_a, b_b = b_b, b_ab = b_ab, b_a_s = b_a_s, b_b_s = b_b_s, b_ab_s = b_ab_s, K_a = K_a, K_b = K_b, X_a = X_a, X_a_s = X_a_s, X_b = X_b, X_b_s = X_b_s, X_ab = X_ab, X_ab_s = X_ab_s, X_full_s = X_full_s,S_a = S_a, S_b = S_b, S_ab = S_ab, C_ab = C_ab, Q_a = Q_a, Q_b = Q_b, Q_ab = Q_ab )}
We need some utility functions to stop this getting too messy.
Utility functions
# Utility function to create combinations of parameters (all columns and # all rows) that should sum to zero.construct_C_ab <-function(K_a, K_b) {# Total number of interaction parameters n_params <- K_a * K_b# Number of constraints n_constraints <- K_a + K_b# Initialize C_ab matrix C_ab <-matrix(0, nrow = n_constraints, ncol = n_params)# Constraints for factor b (columns)for (j in1:K_b) { col_indices <- ((j -1) * K_a +1):(j * K_a) C_ab[j, col_indices] <-1 }# Constraints for factor a (rows)for (i in1:K_a) { row_indices <-seq(i, n_params, by = K_a) C_ab[K_b + i, row_indices] <-1 }return(C_ab)}# Utility function to create full design matrix from constrained space.create_full_design <-function(X_a_s, X_b_s, X_ab_s) { K_a <-nrow(X_a_s) K_b <-nrow(X_b_s) n_a <-ncol(X_a_s) n_b <-ncol(X_b_s) n_ab <-ncol(X_ab_s)# Create all combinations design <-expand.grid(a =1:K_a, b =1:K_b)# Add intercept and main effects X_main <-cbind(1, X_a_s[design$a, ], X_b_s[design$b, ])# Create interaction effects X_int <- X_ab_s[(design$b -1) * K_a + design$a, , drop =FALSE]# it looks like we can just zero out interaction when main effect is zero# so that we have a logical design matrix, i.e. if some of the main factors # are set to zero then the interaction should also be.# for (i in 1:n_a) {# for (j in 1:n_b) {# int_col <- (i - 1) * n_b + j# X_int[X_main[, 1 + i] == 0 | X_main[, 1 + n_a + j] == 0, int_col] <- 0# }# }# Combine all effects X_full <-cbind(X_main, X_int)return(X_full)}# Utility to create random parameter values that sum to zerocreate_sum_zero_matrix <-function(n_rows, n_cols, sd =1, seed =1) {set.seed(seed)# Step 1: Generate random values mat <-matrix(rnorm(n_rows * n_cols, sd = sd), nrow = n_rows, ncol = n_cols)# Step 2: Center columns mat <-scale(mat, center =TRUE, scale =FALSE)# Step 3: Center rows mat <-t(scale(t(mat), center =TRUE, scale =FALSE))# Step 4: Final adjustment to ensure exact zero sums col_adj <-colSums(mat) / n_rows row_adj <-rowSums(mat) / n_colsfor (i in1:n_rows) {for (j in1:n_cols) { mat[i,j] <- mat[i,j] - col_adj[j] - row_adj[i] +mean(col_adj) } }return(mat)}# Utility to create random parameter values that sum to zerocreate_sum_zero_vec <-function(n =3, sd =1, seed =1) {set.seed(seed) v <-scale(rnorm(n, 0, sd), center =TRUE, scale =FALSE)return(as.numeric(v))}
Do some tests to see if the data generation process makes sense.
Not sure that these matrices are logical in terms of the products of the various factors…
The model now splits up the design into three separate matrices so that we can look to recover each grouping of parameters. Also, I convert to a binomial likelihood to speed things up.
Logistic regression
// Model for:// independent estimates of treatment arm by subgroup meansdata {int N;array[N] int y;array[N] int n;// arm indexarray[N, 3] int trt;// factor 1 design matrixint ncXades;int nrXades;matrix[nrXades, ncXades] Xades;vector[ncXades] sa;// factor 2 design matrixint ncXbdes;int nrXbdes;matrix[nrXbdes, ncXbdes] Xbdes;vector[ncXbdes] sb;// factor 1/2 interaction design matrixint ncXabdes;int nrXabdes;matrix[nrXabdes, ncXabdes] Xabdes;vector[ncXabdes] sab;int prior_only;}transformed data {// build full design matricesmatrix[N, ncXades] Xa = Xades[trt[,1]];matrix[N, ncXbdes] Xb = Xbdes[trt[,2]];// indexes the relevant col in the design matrixmatrix[N, ncXabdes] Xab = Xabdes[trt[,3]];}parameters{real mu;vector[ncXades] ba; vector[ncXbdes] bb;vector[ncXabdes] bab;}transformed parameters{ vector[N] eta = mu + Xa*ba + Xb*bb + Xab*bab;}model{target += logistic_lpdf(mu | 0, 1);target += normal_lpdf(ba | 0, sa);target += normal_lpdf(bb | 0, sb);target += normal_lpdf(bab | 0, sab);if(!prior_only){// target += bernoulli_logit_lpmf(y | eta); target += binomial_logit_lpmf(y | n, eta); }}generated quantities{}
Two-factor (3 x 2)
Fit the model without the likelihood involved to examine the implied priors.
Code
m1 <- cmdstanr::cmdstan_model("stan/note-09-logistic-2.stan")l1 <-get_data(N =1e6, K_a =3, K_b =2, b0 =1, b_a =c(0, -1, 1), b_b =c(-0.4, 0.4),# has to have all rows, all cols sum to zero if you # want to target recovering parameters.b_ab =c(-0.05, 0.15, -0.1, 0.05, -0.15, 0.1))d_smry <- l1$d[, .(y =sum(y), n = .N), keyby = .(b, a)]ld <-list(N =nrow(d_smry), y = d_smry$y, n = d_smry$n, trt =cbind(d_smry$a, d_smry$b, d_smry$a + (l1$K_a * (d_smry$b -1))),nrXades =nrow(l1$X_a_s), ncXades =ncol(l1$X_a_s), Xades = l1$X_a_s, sa =rep(1, ncol(l1$X_a_s)),nrXbdes =nrow(l1$X_b_s), ncXbdes =ncol(l1$X_b_s), Xbdes = l1$X_b_s, sb =rep(1, ncol(l1$X_b_s)),nrXabdes =nrow(l1$X_ab_s), ncXabdes =ncol(l1$X_ab_s), Xabdes = l1$X_ab_s, sab =rep(1, ncol(l1$X_ab_s)),prior_only =1)f1 <- m1$sample( ld, iter_warmup =1000, iter_sampling =3000,parallel_chains =2, chains =2, refresh =0, show_exceptions = F,max_treedepth =10)
Running MCMC with 2 parallel chains...
Chain 1 finished in 0.1 seconds.
Chain 2 finished in 0.1 seconds.
Both chains finished successfully.
Mean chain execution time: 0.1 seconds.
Total execution time: 0.1 seconds.
Running MCMC with 2 parallel chains...
Chain 1 finished in 0.1 seconds.
Chain 2 finished in 0.1 seconds.
Both chains finished successfully.
Mean chain execution time: 0.1 seconds.
Total execution time: 0.2 seconds.
Code
m_post_s <- f1$draws(variables =c("ba"), format ="matrix") post_ba <- m_post_s %*%t(l1$X_a_s)knitr::kable(rbind(b_a = l1$b_a,posterior_mean =colMeans(post_ba)), digits =3, caption ="Compare true parameters with posterior means (a factor)")
b_a
-1.125
1.872
0.395
-0.32
-0.822
posterior_mean
-1.120
1.880
0.383
-0.32
-0.822
Compare true parameters with posterior means (a factor)
Code
m_post_s <- f1$draws(variables =c("bb"), format ="matrix") post_bb <- m_post_s %*%t(l1$X_b_s)knitr::kable(rbind(b_b = l1$b_b,posterior_mean =colMeans(post_bb)), digits =3, caption ="Compare true parameters with posterior means (b factor)")
b_b
0.314
-0.314
posterior_mean
0.313
-0.313
Compare true parameters with posterior means (b factor)
Code
m_post_s <- f1$draws(variables =c("bab"), format ="matrix") post_bab <- m_post_s %*%t(l1$X_ab_s)knitr::kable(rbind(b_ab = l1$b_ab,posterior_mean =colMeans(post_bab)), digits =3, caption ="Compare true parameters with posterior means (interaction)")
b_ab
-0.534
-0.082
0.315
0.005
0.295
0.534
0.082
-0.315
-0.005
-0.295
posterior_mean
-0.532
-0.083
0.312
0.004
0.300
0.532
0.083
-0.312
-0.004
-0.300
Compare true parameters with posterior means (interaction)
What is it with kronecker(Q_a, Q_b)?
The problem with kronecker(Q_a, Q_b) was that I was using a column wise definition for the interaction parameters. However, kronecker(Q_a, Q_b) varies b first rather than a which is incompatible with the specification of the interaction parameters.