Just some brief notes/examples to act as basics stats reminders.
Example - LOTP/LOTE
Assume the following true joint distribution for some outcome \(Y\) and its distribution conditional on sex, smoking and vegetarianism.
Code
set.seed (1 )
d_tru <- CJ (sex = 0 : 1 , smk = 0 : 1 , veg = 0 : 1 )
g <- function (sex, smk, veg){
- 1 + 2 * sex - 1 * smk + 3 * veg +
- 1 * sex * smk - 3 * sex * veg + 1 * smk * veg +
0.5 * sex * smk * veg
}
# probability of group membership
q <- rnorm (nrow (d_tru))
d_tru[, p_grp : = exp (q)/ sum (exp (q))]
# probability of outcome
d_tru[, p_y : = plogis (g (sex,smk,veg))]
d_tru
Key: <sex, smk, veg>
sex smk veg p_grp p_y
<int> <int> <int> <num> <num>
1: 0 0 0 0.04225019 0.2689414
2: 0 0 1 0.09498377 0.8807971
3: 0 1 0 0.03427561 0.1192029
4: 0 1 1 0.38968687 0.8807971
5: 1 0 0 0.10989996 0.7310586
6: 1 0 1 0.03479920 0.7310586
7: 1 1 0 0.12870098 0.2689414
8: 1 1 1 0.16540341 0.6224593
Take a random sample from the population:
Code
n <- 1e7
i <- sample (1 : nrow (d_tru), n, T, prob = d_tru$ p_grp)
d <- d_tru[i]
d[, y : = rbinom (.N, 1 , p_y)]
Recover the distribution of the covariates (the p_grp
)
Code
# recover distribution of covariates
d[, .(.N/ nrow (d)), keyby = .(sex, smk, veg)]
Key: <sex, smk, veg>
sex smk veg V1
<int> <int> <int> <num>
1: 0 0 0 0.0421805
2: 0 0 1 0.0950815
3: 0 1 0 0.0343139
4: 0 1 1 0.3898065
5: 1 0 0 0.1096864
6: 1 0 1 0.0348148
7: 1 1 0 0.1288739
8: 1 1 1 0.1652425
Recover the joint distribution of the outcome (the p_y
)
Code
# recover distribution of outcome
d[, .(mu_y = mean (y)), keyby = .(sex, smk, veg)]
Key: <sex, smk, veg>
sex smk veg mu_y
<int> <int> <int> <num>
1: 0 0 0 0.2693899
2: 0 0 1 0.8812230
3: 0 1 0 0.1189314
4: 0 1 1 0.8805292
5: 1 0 0 0.7309010
6: 1 0 1 0.7311000
7: 1 1 0 0.2692407
8: 1 1 1 0.6217244
Estimate the probability of outcome by sex
Code
d[, .(mu_y = mean (y)), keyby = sex]
Key: <sex>
sex mu_y
<int> <num>
1: 0 0.7881758
2: 1 0.5541419
We are interested in \(\mathbb{E}[y | sex]\) . Can we get to this via iterated expectation/LOTE?
Think - is all we need LOTP (with extra conditioning), e.g.
\[
\begin{aligned}
Pr(y | sex = s) &= Pr(y | sex = s, smk = 0, veg = 0) Pr(smk = 0, veg = 0|sex = s) + \\
& \quad Pr(y | sex = s, smk = 0, veg = 1) Pr(smk = 0, veg = 1|sex = s) + \\
& \quad Pr(y | sex = s, smk = 1, veg = 0) Pr(smk = 1, veg = 0|sex = s) + \\
& \quad Pr(y | sex = s, smk = 1, veg = 1) Pr(smk = 1, veg = 1|sex = s)
\end{aligned}
\tag{1}\]
Operationalised, based on the known distributions, we get:
Code
c (
"Pr(Y | sex = 0)" = sum (d_tru[sex == 0 , p_y] * d_tru[sex == 0 , p_grp / sum (p_grp)] ),
"Pr(Y | sex = 1)" = sum (d_tru[sex == 1 , p_y] * d_tru[sex == 1 , p_grp / sum (p_grp)] )
)
Pr(Y | sex = 0) Pr(Y | sex = 1)
0.7882179 0.5545841
And which can be estimated from the simulated data:
Code
c (
"Pr(Y | sex = 0)" =
sum (d[sex == 0 , mean (y)] *
d[sex == 0 , .(p_grp = .N/ nrow (d[sex == 0 ])), keyby = .(smk, veg)]$ p_grp ),
"Pr(Y | sex = 1)" =
sum (d[sex == 1 , mean (y)] *
d[sex == 1 , .(p_grp = .N/ nrow (d[sex == 1 ])), keyby = .(smk, veg)]$ p_grp)
)
Pr(Y | sex = 0) Pr(Y | sex = 1)
0.7881758 0.5541419
Take Equation 1 , multiply both sides by \(y\) and then sum over all \(y\)
\[
\begin{aligned}
\sum_y y Pr(y | sex = s) &= \sum_y y Pr(y | sex = s, smk = 0, veg = 0) Pr(smk = 0, veg = 0|sex = s) + \\
& \quad \sum_y y Pr(y | sex = s, smk = 0, veg = 1) Pr(smk = 0, veg = 1|sex = s) + \\
& \quad \sum_y y Pr(y | sex = s, smk = 1, veg = 0) Pr(smk = 1, veg = 0|sex = s) + \\
& \quad \sum_y y Pr(y | sex = s, smk = 1, veg = 1) Pr(smk = 1, veg = 1|sex = s)
\end{aligned}
\tag{2}\]
which (I think) can be re-stated as
\[
\begin{aligned}
\mathbb{E}(Y|sex=s) &= \mathbb{E}(Y|sex = s, smk = 0, veg = 0) Pr(smk = 0, veg = 0|sex = s) + \\
& \quad \mathbb{E}(Y|sex = s, smk = 0, veg = 1) Pr(smk = 0, veg = 1|sex = s) + \\
& \quad \mathbb{E}(Y|sex = s, smk = 1, veg = 0) Pr(smk = 1, veg = 0|sex = s) + \\
& \quad \mathbb{E}(Y|sex = s, smk = 1, veg = 1) Pr(smk = 1, veg = 1|sex = s) \\
&= \sum_{smk,veg} \mathbb{E}(Y|sex = s, smk, veg) Pr(smk, veg|sex = s)
\end{aligned}
\tag{3}\]
also note (just in case one is easier to determine than another):
\[
Pr(smk, veg|sex) = Pr(smk|veg, sex) Pr(veg | sex)
\]
so (perhaps) alternatively:
\[
\begin{aligned}
\mathbb{E}(Y|sex=s) &= \sum_{smk,veg} \mathbb{E}(Y|sex = s, smk, veg) Pr(smk|veg, sex = s) Pr(veg | sex = s)
\end{aligned}
\tag{4}\]