which apply to both the response data and covariate data.
The way that you approach the analysis where you have missingness is strongly influenced by the assumptions you can make about the missing data process. Complete case analysis is often chastised but it is sometimes quite a reasonable approach. The following attempts to clarify some of the concepts by way of a simple example.
The following encodes a two-arm parallel group setting where the outcome is treatment failure indicated by \(y = 1\). The data include a treatment group indicator (x), a baseline covariate (a) and an indicator for the presence of adverse event (ae).
We can configure the extent to which the outcome y is influenced by these variables and also how the occurrence of adverse events arises. For example, an adverse event may make the chances of a outcome failure more likely.
Data generation function
get_data <-function(N =2000, par =NULL,# outcome modelf_y =function(par, x, a, ae){ eta = par$b_out[1] + par$b_out[2]*x + par$b_out[3]*a + par$b_out[4]*ae eta },# model for adverse eventsf_ae =function(par, x, a){ eta = par$b_ae[1] + par$b_ae[2]*x + par$b_ae[3]*a eta } ){# strata d <-data.table()# intervention - even allocation d[, x :=rbinom(N, 1, par$pr_x)]# baseline covariate indep to x d[, a :=rbinom(N, 1, par$pr_a)]# dep on both x and a d[, eta_ae :=f_ae(par, x, a)] d[, ae :=rbinom(.N, 1, plogis(eta_ae))]# outcome model d[, eta_y :=f_y(par, x, a, ae)]# y = 1 implies treatment failure d[, y :=rbinom(.N, 1, plogis(eta_y))] d }
Code
par <-list(# betas for outcome model (need to align with f_y)# intervention arm is detrimental (increases the chance of failure)b_out =c(0.4, 0.3, 0.2, 0.5),# betas for ae model b0, b_x, b_a (need to align with f_ae)# intervention arm increases likelihood of ae, as does baseline covb_ae =c(-2, 0.5, 0.1),pr_x =0.5,pr_a =0.7)# sanity check/consistencyset.seed(2222222)d <-get_data(1e6, par)# align with the outcome modelX <-as.matrix(CJ(x =0:1, a =0:1, ae =0:1))d_X <-cbind(X, eta =as.numeric(cbind(1, X) %*% par$b_out))# total effect of exposure (x) on y through all paths # i.e. possibly through any adverse events as well as the direct pathate_obs <-coef(glm(y ~ x, data = d, family = binomial))[2]
Below is a quick sanity check to see if we can empirically validate the log-odds of the response in each covariate combination.
For mcar, the probability of missingness depends on neither the observed or unobserved data. In the Figure 1 (a) DAG, the indicator of missingness (\(m\)) arises as a function of an independent process - it has nothing to do with the observed or unobserved covariates or response and so it sits off by itself as an independent node.
The diagram to the right shows what we might use for the statistical model. The circles show random variables and the squares show observables and the arros show stochastic and logical depdence. The boxes encasing the variables indicate iteration over units.
On the left hand side of the model is the representation of the outcome process. Here y is just a function of \(x\) and \(a\). On the right is the missingness model.
(a) DAG MCAR
(b) Model MCAR
Figure 1: MCAR
Code
d <-get_data(1e5, par, f_y =function(par, x, a, ae){ eta = par$b_out[1] + par$b_out[2]*x + par$b_out[3]*a eta })# missingness is simply stochasticd[, m :=rbinom(.N, 1, 0.2)]
We can incorporate this missing data mechanism into the data simulation. We retain the full data but create an indicator m to tell us which of the rows would be missing.
We would not observe the outcome for those records that were missing (i.e. hose that have \(m=1\)). However, under MCAR, the distribution of the outcome is the same in expectation in the observed and the missing data.
Observed mean response stratified by missingness status
d_fig <-copy(d)d_fig[, `:=`(x=factor(x, labels =c("ctl", "test")), a=factor(a, labels =c("<50", ">=50")), ae=factor(ae, labels =c("no-ae", "ae")), m =factor(m, labels =c("obs", "mis")))]d_fig <- d_fig[, .(y =sum(y), n = .N), keyby = .(x, a, m)]d_fig[, eta_obs :=qlogis(y / n)]ggplot(d_fig, aes(x= x, y = eta_obs)) +geom_bar(position="dodge", stat ="identity") +scale_x_discrete("Exposure") +scale_y_continuous("log-odds response") +facet_grid(a ~ m)
Figure 2: Observed mean response
Given that the missingness is independent of the outcome, the complete case analysis is unbiased.
Code
d_m <- d[, .(y =sum(y), n = .N), keyby = .(x, a)]f1_ref <-glm(cbind(y, n-y) ~ x + a, data = d_m, family = binomial)coef_f1_ref <-summary(f1_ref)$coef# only select the observed casesd_m <- d[m ==0, .(y =sum(y), n = .N), keyby = .(x, a)]f1_cc <-glm(cbind(y, n-y) ~ x + a, data = d_m, family = binomial)coef_f1_cc <-summary(f1_cc)$coefd_tbl <-rbind(data.table(desc ="full data", par =rownames(coef_f1_ref), coef_f1_ref),data.table(desc ="complete case", par =rownames(coef_f1_cc), coef_f1_cc))d_tbl[, mu_se :=sprintf("%.2f (%.3f)", Estimate, `Std. Error`)]d_tbl[, par :=factor(par, levels =c("(Intercept)", "x", "a"))]knitr::kable(rbind(dcast(d_tbl, desc ~ par, value.var ="mu_se")))
desc
(Intercept)
x
a
complete case
0.41 (0.015)
0.29 (0.015)
0.20 (0.016)
full data
0.41 (0.014)
0.30 (0.013)
0.19 (0.015)
We can use multiple imputation (which assumes MAR) for both the MAR and MCAR setting (the latter being a special case of the former). The mice package will default to predicting missing columns on all other variables in the data in line with attempting to maximal uncertainty. We can see that there isn’t really any benefit to imputation if the missing data mechanism is MCAR.
Code
m1 <- cmdstanr::cmdstan_model("stan/missing-ex-01.stan")d <-get_data(1e5, par, f_y =function(par, x, a, ae){ eta = par$b_out[1] + par$b_out[2]*x + par$b_out[3]*a eta })# missingness is indep processd[, m :=rbinom(.N, 1, 0.2)]d_bin <- d[, .(y =sum(y), n = .N), keyby = .(x, a)]# full datald <-list(N_obs =nrow(d_bin),y = d_bin[, y], n = d_bin[, n], P =3,X_obs =model.matrix(~ x + a, data = d_bin),prior_only =0)f1_ref <- m1$sample(ld, iter_warmup =1000, iter_sampling =1000,parallel_chains =2, chains =2, refresh =0, show_exceptions = F,max_treedepth =10)
Running MCMC with 2 parallel chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Both chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.2 seconds.
Code
d_post_ref <-data.table(f1_ref$draws(variables ="b", format ="matrix"))d_post_ref[, desc :="full data"]# d_post_ref[, lapply(.SD, mean), .SDcols = paste0("b[", 1:3, "]")]d_bin <- d[m ==0, .(y =sum(y), n = .N), keyby = .(x, a)]# complete caseld <-list(N_obs =nrow(d_bin),y = d_bin[, y], n = d_bin[, n], P =3,X_obs =model.matrix(~ x + a, data = d_bin),prior_only =0)f1_cc <- m1$sample(ld, iter_warmup =1000, iter_sampling =1000,parallel_chains =2, chains =2, refresh =0, show_exceptions = F,max_treedepth =10)
Running MCMC with 2 parallel chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Both chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.1 seconds.
Code
d_post_cc <-data.table(f1_cc$draws(variables ="b", format ="matrix"))d_post_cc[, desc :="complete case"]# d_post_cc[, lapply(.SD, mean), .SDcols = paste0("b[", 1:3, "]")]# number that are missing# d[, .N, keyby = m]# imputation setsd_imp <-copy(d[, .(x, a, y, m)])d_imp[m ==1, y :=NA]d_imp[, m :=NULL]n_imp <-50# dumb mice needs factor if you use logregd_imp[, `:=`(x =factor(x), a =factor(a), y =factor(y))]l_imp <-mice(d_imp, m = n_imp, method ="logreg",seed =23109, printFlag = F)# print(l_imp)i <-1d_post_imp <-rbindlist(mclapply(1:n_imp, function(i){# pick up the imputed data set d_cur <-data.table(complete(l_imp, i)) d_cur[, `:=`(x=as.numeric(as.character(x)),a=as.numeric(as.character(a)),y=as.numeric(as.character(y)))] d_bin <- d_cur[, .(y =sum(y), n = .N), keyby = .(x, a)]# full (imputed) data ld <-list(N_obs =nrow(d_bin),y = d_bin[, y], n = d_bin[, n], P =3,X_obs =model.matrix(~ x + a, data = d_bin),prior_only =0 ) snk <-capture.output( f1_imp <- m1$sample( ld, iter_warmup =1000, iter_sampling =1000,parallel_chains =2, chains =2, refresh =0, show_exceptions = F,max_treedepth =10) ) d_post <-data.table(f1_imp$draws(variables ="b", format ="matrix")) d_post}), idcol ="id_imp")d_post_imp[, desc :="mi"]# d_post_imp[, lapply(.SD, mean), .SDcols = paste0("b[", 1:3, "]")]
Posterior inference on parameters for full data, complete case and MI
Figure 3: Posterior inference on parameters for full data, complete case and MI
Missing at random
For mar, the probability of missingness depends only on the observed data.
In the Figure 4 (a) DAG, the indicator of missingness arises as a function of the exposure (\(x\)). You could think of this as the possibility of more missingness in the treatment relative to that in the control arm. Again, we would not observe the outcome for those records with \(m=1\). From the DAG you can see that \(y\) and \(m\) are conditionally independent given \(x\). Given that we include \(x\) in the model, the estimates for the treatment effect will be unbiased.
(a) DAG MAR
(b) Model MAR
Figure 4: MAR
Code
d <-get_data(1e5, par, f_y =function(par, x, a, ae){ eta = par$b_out[1] + par$b_out[2]*x + par$b_out[3]*a eta })# missingness is a function of observed covariatesf_mar <-function(par, x, a){ eta =-1+2*x eta}d[, eta_m :=f_mar(par, x, a)]d[, m :=rbinom(.N, 1, plogis(eta_m))]
In the simulated data, cross tabulating \(y\) and \(m\) (the observed data and the indicator of missingness) indicates they are dependent.
Code
d_tbl <-table(d[, .(y, m)])X <-chisq.test(d_tbl)X
Pearson's Chi-squared test with Yates' continuity correction
data: d_tbl
X-squared = 115.95, df = 1, p-value < 2.2e-16
But when stratified by the exposure and baseline covariates, we cannot conclude any dependence.
Pearson's Chi-squared test with Yates' continuity correction
data: d_tbl
X-squared = 0.16968, df = 1, p-value = 0.6804
Conditional on the covariates, the probability of missingness is the same for all levels of the outcome across all strata.
Code
d_tbl <-CJ(x =0:1, a =0:1, y =0:1)d_tbl[, p_mis :=plogis(f_mar(par, x, a))]# d_tbl <- unique(d_tbl[, .(x, y, p_mis)])knitr::kable(rbind(dcast(d_tbl, x + a ~ y, value.var ="p_mis")), col.names =c("x", "a","trt.success", "trt.failure"), digits =2)
x
a
trt.success
trt.failure
0
0
0.27
0.27
0
1
0.27
0.27
1
0
0.73
0.73
1
1
0.73
0.73
Which means that the distribution of outcome will be approximately the same in each strata; the missingness is independent of the outcome conditional on the covariates.
Observed mean response stratified by missingness status
d_fig <-copy(d)d_fig[, `:=`(x=factor(x, labels =c("ctl", "test")), a=factor(a, labels =c("<50", ">=50")), ae=factor(ae, labels =c("no-ae", "ae")), m =factor(m, labels =c("obs", "mis")))]d_fig <- d_fig[, .(y =sum(y), n = .N), keyby = .(x, a, m)]d_fig[, eta_obs :=qlogis(y / n)]ggplot(d_fig, aes(x= x, y = eta_obs)) +geom_bar(position="dodge", stat ="identity") +scale_x_discrete("Exposure") +scale_y_continuous("log-odds response") +facet_grid(a ~ m)
Figure 5: Observed mean response
So, as long as we condition on the predictor of missingness, the complete case estimates are unbiased, albeit inefficient.
Code
d_m <- d[, .(y =sum(y), n = .N), keyby = .(x, a)]f1_ref <-glm(cbind(y, n-y) ~ x + a, data = d_m, family = binomial)coef_f1_ref <-summary(f1_ref)$coef# only select the observed casesd_m <- d[m ==0, .(y =sum(y), n = .N), keyby = .(x, a)]f1_cc <-glm(cbind(y, n-y) ~ x + a, data = d_m, family = binomial)coef_f1_cc <-summary(f1_cc)$coefd_tbl <-rbind(data.table(desc ="full data", par =rownames(coef_f1_ref), coef_f1_ref),data.table(desc ="complete case", par =rownames(coef_f1_cc), coef_f1_cc))d_tbl[, mu_se :=sprintf("%.2f (%.3f)", Estimate, `Std. Error`)]d_tbl[, par :=factor(par, levels =c("(Intercept)", "x", "a"))]knitr::kable(rbind(dcast(d_tbl, desc ~ par, value.var ="mu_se")))
desc
(Intercept)
x
a
complete case
0.41 (0.018)
0.30 (0.022)
0.19 (0.020)
full data
0.39 (0.014)
0.31 (0.013)
0.21 (0.015)
Similarly, if the missingness arises because of some baseline covariate, then the estimate for the exposure remains unbiased so long as we adjust for the relevant covariate.
Code
d <-get_data(1e5, par, f_y =function(par, x, a, ae){ eta = par$b_out[1] + par$b_out[2]*x + par$b_out[3]*a eta })# missingness is a function of observed covariatesf_mar <-function(par, y, x, a){ eta =-1+2*a eta}d[, eta_m :=f_mar(par, y, x, a)]d[, m :=rbinom(.N, 1, plogis(eta_m))]
Code
d_m <- d[, .(y =sum(y), n = .N), keyby = .(x, a)]f1_ref <-glm(cbind(y, n-y) ~ x + a, data = d_m, family = binomial)coef_f1_ref <-summary(f1_ref)$coef# only select the observed casesd_m <- d[m ==0, .(y =sum(y), n = .N), keyby = .(x, a)]f1_cc <-glm(cbind(y, n-y) ~ x + a, data = d_m, family = binomial)coef_f1_cc <-summary(f1_cc)$coefd_tbl <-rbind(data.table(desc ="full data", par =rownames(coef_f1_ref), coef_f1_ref),data.table(desc ="complete case", par =rownames(coef_f1_cc), coef_f1_cc))d_tbl[, mu_se :=sprintf("%.2f (%.3f)", Estimate, `Std. Error`)]d_tbl[, par :=factor(par, levels =c("(Intercept)", "x", "a"))]knitr::kable(rbind(dcast(d_tbl, desc ~ par, value.var ="mu_se")))
desc
(Intercept)
x
a
complete case
0.40 (0.017)
0.29 (0.021)
0.21 (0.021)
full data
0.39 (0.014)
0.30 (0.013)
0.21 (0.014)
For mcar and mar, the missing data mechanism is referred to as ignorable because we don’t need to specify a model for the missingness in order to make valid inference. We can again use multiple imputation but because we have conditioned the model correctly, there still isn’t any real benefit in the multiple imputation approach.
Code
m1 <- cmdstanr::cmdstan_model("stan/missing-ex-01.stan")d <-get_data(1e5, par,f_y =function(par, x, a, ae){ eta = par$b_out[1] + par$b_out[2]*x + par$b_out[3]*a eta })# missingness is mar# a function of observed covariatesf_mar <-function(par, x, a){ eta =-1+2*x eta}d[, eta_m :=f_mar(par, x, a)]d[, m :=rbinom(.N, 1, plogis(eta_m))]d_bin <- d[, .(y =sum(y), n = .N), keyby = .(x, a)]# full datald <-list(N_obs =nrow(d_bin),y = d_bin[, y], n = d_bin[, n], P =3,X_obs =model.matrix(~ x + a, data = d_bin),prior_only =0)f1_ref <- m1$sample(ld, iter_warmup =1000, iter_sampling =1000,parallel_chains =2, chains =2, refresh =0, show_exceptions = F,max_treedepth =10)
Running MCMC with 2 parallel chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Both chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.1 seconds.
Code
d_post_ref <-data.table(f1_ref$draws(variables ="b", format ="matrix"))d_post_ref[, desc :="full data"]# d_post_ref[, lapply(.SD, mean), .SDcols = paste0("b[", 1:3, "]")]d_bin <- d[m ==0, .(y =sum(y), n = .N), keyby = .(x, a)]# complete caseld <-list(N_obs =nrow(d_bin),y = d_bin[, y], n = d_bin[, n], P =3,X_obs =model.matrix(~ x + a, data = d_bin),prior_only =0)f1_cc <- m1$sample(ld, iter_warmup =1000, iter_sampling =1000,parallel_chains =2, chains =2, refresh =0, show_exceptions = F,max_treedepth =10)
Running MCMC with 2 parallel chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Both chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.1 seconds.
Code
d_post_cc <-data.table(f1_cc$draws(variables ="b", format ="matrix"))d_post_cc[, desc :="complete case"]# d_post_cc[, lapply(.SD, mean), .SDcols = paste0("b[", 1:3, "]")]# number that are missing# d[, .N, keyby = m]# imputation setsd_imp <-copy(d[, .(x, a, y, m)])d_imp[m ==1, y :=NA]d_imp[, m :=NULL]# dumb mice needs factor if you use logregd_imp[, `:=`(x =factor(x), a =factor(a), y =factor(y))]n_imp <-50l_imp <-mice(d_imp, m = n_imp, method ="logreg",seed =23109, printFlag = F)# print(l_imp)i <-1d_post_imp <-rbindlist(mclapply(1:n_imp, function(i){# pick up the imputed data set d_cur <-data.table(complete(l_imp, i)) d_cur[, `:=`(x=as.numeric(as.character(x)),a=as.numeric(as.character(a)),y=as.numeric(as.character(y)))] d_bin <- d_cur[, .(y =sum(y), n = .N), keyby = .(x, a)]# full (imputed) data ld <-list(N_obs =nrow(d_bin),y = d_bin[, y], n = d_bin[, n], P =3,X_obs =model.matrix(~ x + a, data = d_bin),prior_only =0 ) snk <-capture.output( f1_imp <- m1$sample( ld, iter_warmup =1000, iter_sampling =1000,parallel_chains =2, chains =2, refresh =0, show_exceptions = F,max_treedepth =10) ) d_post <-data.table(f1_imp$draws(variables ="b", format ="matrix")) d_post}), idcol ="id_imp")d_post_imp[, desc :="mi"]# d_post_imp[, lapply(.SD, mean), .SDcols = paste0("b[", 1:3, "]")]
Posterior inference on parameters for full data, complete case and MI
Figure 6: Posterior inference on parameters for full data, complete case and MI
Missing not at random
If neither mcar or mar hold, then the data are mnar. For mnar, the missing value tells us something about what the value might have been; missingness here is informative, e.g. a salary reporting - people who get paid more tend not to disclose their salary.
(a) DAG MNAR
(b) Model MNAR
Figure 7: MNAR
Code
d <-get_data(1e5, par, f_y =function(par, x, a, ae){ eta = par$b_out[1] + par$b_out[2]*x + par$b_out[3]*a eta })# missingness is a function of observed covariates and outcomef_mnar <-function(par, y, x, a){ eta =-1+0.5*x +0.5*y eta}d[, eta_m :=f_mnar(par, y, x, a)]d[, m :=rbinom(.N, 1, plogis(eta_m))]
The probability of the outcome being missing is now conditional on what would have been observed in the outcome as well as the exposure level with a higher likelihood of missingness in people who would have treatment failure (\(y = 1\) here).
Code
d_tbl <-CJ(y =0:1, x =0:1, a =0:1)d_tbl[, p_mis :=plogis(f_mnar(par, y, x, a))]knitr::kable(rbind(dcast(d_tbl, x + a ~ y, value.var ="p_mis")), col.names =c("x", "a", "trt.success", "trt.failure"), digits =2)
x
a
trt.success
trt.failure
0
0
0.27
0.38
0
1
0.27
0.38
1
0
0.38
0.50
1
1
0.38
0.50
As a consequence, the distribution of the outcome is no longer same in the observed and the missing data.
Observed mean response stratified by missingness status
d_fig <-copy(d)d_fig[, `:=`(x=factor(x, labels =c("ctl", "test")), a=factor(a, labels =c("<50", ">=50")), ae=factor(ae, labels =c("no-ae", "ae")), m =factor(m, labels =c("obs", "mis")))]d_fig <- d_fig[, .(y =sum(y), n = .N), keyby = .(x, a, m)]d_fig[, eta_obs :=qlogis(y / n)]ggplot(d_fig, aes(x= x, y = eta_obs)) +geom_bar(position="dodge", stat ="identity") +scale_x_discrete("Exposure") +scale_y_continuous("log-odds response") +facet_grid(a ~ m)
Figure 8: Observed mean response
Now, even after we have conditioned on \(x\) a dependency exists between \(y\) and \(m\). For example, people who have early signs of treatment failure (or success) might choose to leave the study. Under this setting, the treatment effect will generally be biased.
Code
d_m <- d[, .(y =sum(y), n = .N), keyby = .(x, a)]f1_ref <-glm(cbind(y, n-y) ~ x + a, data = d_m, family = binomial)coef_f1_ref <-summary(f1_ref)$coef# only select the observed casesd_m <- d[m ==0, .(y =sum(y), n = .N), keyby = .(x, a)]f1_cc <-glm(cbind(y, n-y) ~ x + a, data = d_m, family = binomial)coef_f1_cc <-summary(f1_cc)$coefd_tbl <-rbind(data.table(desc ="full data", par =rownames(coef_f1_ref), coef_f1_ref),data.table(desc ="complete case", par =rownames(coef_f1_cc), coef_f1_cc))d_tbl[, mu_se :=sprintf("%.2f (%.3f)", Estimate, `Std. Error`)]d_tbl[, par :=factor(par, levels =c("(Intercept)", "x", "a"))]knitr::kable(rbind(dcast(d_tbl, desc ~ par, value.var ="mu_se")))
desc
(Intercept)
x
a
complete case
0.24 (0.017)
0.24 (0.017)
0.17 (0.018)
full data
0.40 (0.014)
0.32 (0.013)
0.18 (0.014)
Again, we can try multiple imputation but because MI assumes MAR, things go wrong without some additional assumptions.
Code
m1 <- cmdstanr::cmdstan_model("stan/missing-ex-01.stan")d <-get_data(1e5, par,f_y =function(par, x, a, ae){ eta = par$b_out[1] + par$b_out[2]*x + par$b_out[3]*a eta })# missingness is a function of observed covariates and outcomef_mnar <-function(par, y, x, a){ eta =-1+0.5*x +0.5*y eta}d[, eta_m :=f_mnar(par, y, x, a)]d[, m :=rbinom(.N, 1, plogis(eta_m))]d_bin <- d[, .(y =sum(y), n = .N), keyby = .(x, a)]# full datald <-list(N_obs =nrow(d_bin),y = d_bin[, y], n = d_bin[, n], P =3,X_obs =model.matrix(~ x + a, data = d_bin),prior_only =0)f1_ref <- m1$sample(ld, iter_warmup =1000, iter_sampling =1000,parallel_chains =2, chains =2, refresh =0, show_exceptions = F,max_treedepth =10)
Running MCMC with 2 parallel chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Both chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.1 seconds.
Code
d_post_ref <-data.table(f1_ref$draws(variables ="b", format ="matrix"))d_post_ref[, desc :="full data"]d_bin <- d[m ==0, .(y =sum(y), n = .N), keyby = .(x, a)]# complete caseld <-list(N_obs =nrow(d_bin),y = d_bin[, y], n = d_bin[, n], P =3,X_obs =model.matrix(~ x + a, data = d_bin),prior_only =0)f1_cc <- m1$sample(ld, iter_warmup =1000, iter_sampling =1000,parallel_chains =2, chains =2, refresh =0, show_exceptions = F,max_treedepth =10)
Running MCMC with 2 parallel chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Both chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.1 seconds.
Code
d_post_cc <-data.table(f1_cc$draws(variables ="b", format ="matrix"))d_post_cc[, desc :="complete case"]# number that are missing# d[, .N, keyby = m]# imputation setsd_imp <-copy(d[, .(x, a, y, m)])d_imp[m ==1, y :=NA]d_imp[, m :=NULL]n_imp <-50# dumb mice needs factor if you use logregd_imp[, `:=`(x =factor(x), a =factor(a), y =factor(y))]l_imp <-mice(d_imp, m = n_imp, method ="logreg",seed =23109, printFlag = F)# print(l_imp)i <-1d_post_imp <-rbindlist(mclapply(1:n_imp, function(i){# pick up the imputed data set d_cur <-data.table(complete(l_imp, i)) d_cur[, `:=`(x=as.numeric(as.character(x)),a=as.numeric(as.character(a)),y=as.numeric(as.character(y)))] d_bin <- d_cur[, .(y =sum(y), n = .N), keyby = .(x, a)]# full (imputed) data ld <-list(N_obs =nrow(d_bin),y = d_bin[, y], n = d_bin[, n], P =3,X_obs =model.matrix(~ x + a, data = d_bin),prior_only =0 ) snk <-capture.output( f1_imp <- m1$sample( ld, iter_warmup =1000, iter_sampling =1000,parallel_chains =2, chains =2, refresh =0, show_exceptions = F,max_treedepth =10) ) d_post <-data.table(f1_imp$draws(variables ="b", format ="matrix")) d_post}), idcol ="id_imp")d_post_imp[, desc :="mi"]
Posterior inference on parameters for full data, complete case and MI
Figure 9: Posterior inference on parameters for full data, complete case and MI
A complete case analysis will be unbiased for the treatment effect of interest when the missingness is not depenendent on that term. However, the parameter estimates for the baseline log-odds (intercept) and any term on which the missingness is dependent will be biased.
These results are specific to logistic regression (or more accurately odds ratios, which are a relative measure) and are in contrast to linear regression where the treatment effect will be biased if the missingness is dependent on the outcome regardless of whether adjustment is made. In linear regression the treatment effect from a complete case analysis will be biased if the missingness is due to (1) the outcome alone (2) the outcome and treatment exposure (3) the outcome and baseline factors and is only unbiased under MCAR and when the missingness depends only on the exposure.