Reminders

Published

June 20, 2025

Modified

May 6, 2025

Warning: package 'cmdstanr' was built under R version 4.4.3

Just some brief notes/examples to act as basics stats reminders.

Example - LOTP/LOTE

Assume the following true joint distribution for some outcome \(Y\) and its distribution conditional on sex, smoking and vegetarianism.

Code

set.seed(1)
d_tru <- CJ(sex = 0:1, smk = 0:1, veg = 0:1)
g <- function(sex, smk, veg){
  -1 + 2 * sex - 1 * smk + 3 * veg +
    -1 * sex * smk - 3 * sex * veg + 1 * smk * veg +
    0.5 * sex * smk * veg
}
# probability of group membership
q <- rnorm(nrow(d_tru))
d_tru[, p_grp := exp(q)/sum(exp(q))]
# probability of outcome
d_tru[, p_y := plogis(g(sex,smk,veg))]
d_tru

Key: <sex, smk, veg>
     sex   smk   veg      p_grp       p_y
   <int> <int> <int>      <num>     <num>
1:     0     0     0 0.04225019 0.2689414
2:     0     0     1 0.09498377 0.8807971
3:     0     1     0 0.03427561 0.1192029
4:     0     1     1 0.38968687 0.8807971
5:     1     0     0 0.10989996 0.7310586
6:     1     0     1 0.03479920 0.7310586
7:     1     1     0 0.12870098 0.2689414
8:     1     1     1 0.16540341 0.6224593

Take a random sample from the population:

Code

n <- 1e7
i <- sample(1:nrow(d_tru), n, T, prob = d_tru$p_grp)
d <- d_tru[i]
d[, y := rbinom(.N, 1, p_y)]

Recover the distribution of the covariates (the p_grp)

Code

# recover distribution of covariates
d[, .(.N/nrow(d)), keyby = .(sex, smk, veg)]

Key: <sex, smk, veg>
     sex   smk   veg        V1
   <int> <int> <int>     <num>
1:     0     0     0 0.0421805
2:     0     0     1 0.0950815
3:     0     1     0 0.0343139
4:     0     1     1 0.3898065
5:     1     0     0 0.1096864
6:     1     0     1 0.0348148
7:     1     1     0 0.1288739
8:     1     1     1 0.1652425

Recover the joint distribution of the outcome (the p_y)

Code

# recover distribution of outcome
d[, .(mu_y = mean(y)), keyby = .(sex, smk, veg)]

Key: <sex, smk, veg>
     sex   smk   veg      mu_y
   <int> <int> <int>     <num>
1:     0     0     0 0.2693899
2:     0     0     1 0.8812230
3:     0     1     0 0.1189314
4:     0     1     1 0.8805292
5:     1     0     0 0.7309010
6:     1     0     1 0.7311000
7:     1     1     0 0.2692407
8:     1     1     1 0.6217244

Estimate the probability of outcome by sex

Code

d[, .(mu_y = mean(y)), keyby = sex]

Key: <sex>
     sex      mu_y
   <int>     <num>
1:     0 0.7881758
2:     1 0.5541419

We are interested in \(\mathbb{E}[y | sex]\). Can we get to this via iterated expectation/LOTE?

Think - is all we need LOTP (with extra conditioning), e.g.

\[ \begin{aligned} Pr(y | sex = s) &= Pr(y | sex = s, smk = 0, veg = 0) Pr(smk = 0, veg = 0|sex = s) + \\ & \quad Pr(y | sex = s, smk = 0, veg = 1) Pr(smk = 0, veg = 1|sex = s) + \\ & \quad Pr(y | sex = s, smk = 1, veg = 0) Pr(smk = 1, veg = 0|sex = s) + \\ & \quad Pr(y | sex = s, smk = 1, veg = 1) Pr(smk = 1, veg = 1|sex = s) \end{aligned} \tag{1}\]

Operationalised, based on the known distributions, we get:

Code

c(
  "Pr(Y | sex = 0)" = sum(d_tru[sex == 0, p_y] * d_tru[sex == 0, p_grp / sum(p_grp)] ),
  "Pr(Y | sex = 1)" = sum(d_tru[sex == 1, p_y] * d_tru[sex == 1, p_grp / sum(p_grp)] )
)

Pr(Y | sex = 0) Pr(Y | sex = 1) 
      0.7882179       0.5545841

And which can be estimated from the simulated data:

Code

c(
  "Pr(Y | sex = 0)" = 
    sum(d[sex == 0, mean(y)] * 
          d[sex == 0, .(p_grp = .N/nrow(d[sex == 0])), keyby = .(smk, veg)]$p_grp ),
  "Pr(Y | sex = 1)" = 
    sum(d[sex == 1, mean(y)] * 
          d[sex == 1, .(p_grp = .N/nrow(d[sex == 1])), keyby = .(smk, veg)]$p_grp)
)

Pr(Y | sex = 0) Pr(Y | sex = 1) 
      0.7881758       0.5541419

Take Equation 1, multiply both sides by \(y\) and then sum over all \(y\)

\[ \begin{aligned} \sum_y y Pr(y | sex = s) &= \sum_y y Pr(y | sex = s, smk = 0, veg = 0) Pr(smk = 0, veg = 0|sex = s) + \\ & \quad \sum_y y Pr(y | sex = s, smk = 0, veg = 1) Pr(smk = 0, veg = 1|sex = s) + \\ & \quad \sum_y y Pr(y | sex = s, smk = 1, veg = 0) Pr(smk = 1, veg = 0|sex = s) + \\ & \quad \sum_y y Pr(y | sex = s, smk = 1, veg = 1) Pr(smk = 1, veg = 1|sex = s) \end{aligned} \tag{2}\]

which (I think) can be re-stated as

\[ \begin{aligned} \mathbb{E}(Y|sex=s) &= \mathbb{E}(Y|sex = s, smk = 0, veg = 0) Pr(smk = 0, veg = 0|sex = s) + \\ & \quad \mathbb{E}(Y|sex = s, smk = 0, veg = 1) Pr(smk = 0, veg = 1|sex = s) + \\ & \quad \mathbb{E}(Y|sex = s, smk = 1, veg = 0) Pr(smk = 1, veg = 0|sex = s) + \\ & \quad \mathbb{E}(Y|sex = s, smk = 1, veg = 1) Pr(smk = 1, veg = 1|sex = s) \\ &= \sum_{smk,veg} \mathbb{E}(Y|sex = s, smk, veg) Pr(smk, veg|sex = s) \end{aligned} \tag{3}\]

also note¹ (just in case one is easier to determine than another):

\[ Pr(smk, veg|sex) = Pr(smk|veg, sex) Pr(veg | sex) \]

so (perhaps) alternatively:

\[ \begin{aligned} \mathbb{E}(Y|sex=s) &= \sum_{smk,veg} \mathbb{E}(Y|sex = s, smk, veg) Pr(smk|veg, sex = s) Pr(veg | sex = s) \end{aligned} \tag{4}\]

Footnotes

For the proof, consider \(P(Y,X|Z) = \frac{Pr(X,Y,Z)}{Pr(Z)}\), \(P(Y|X,Z) = \frac{Pr(X,Y,Z)}{Pr(X,Z)}\) and \(Pr(X|Z) = \frac{Pr(X,Z)}{Pr(Z)}\).↩︎