Statistical Computing for Data Analysis
So far, we have mostly discussed permutation and bootstrap for iid rows.
But many datasets have additional structure:
The hard part is usually not writing the loop.
The hard part is deciding what the resampling unit should be.
To choose a valid permutation or bootstrap method, we need to ask:
If we ignore the design of the study or the dependence structure in the data, a resampling method can answer the wrong question.
Ordinary resampling works well when observations can be treated as interchangeable individual rows.
But in many settings, the data have additional structure that must be preserved.
Sometimes the resampling scheme should reflect how the data were collected or how treatment was assigned.
Sometimes the resampling scheme should reflect which observations are dependent.
The resampling unit should match the unit of exchangeability or independence.
If treatment was randomized within each block, then under the null hypothesis the treatment labels are exchangeable within block.
This preserves the original blocking structure.
set.seed(4279)
block_dat <- data.frame(
block = rep(1:8, each = 2),
trt = rep(c("A", "B"), times = 8),
y = c(12.4, 11.8,
10.1, 9.6,
13.0, 12.2,
9.5, 8.9,
11.7, 11.1,
14.2, 13.4,
10.8, 10.0,
12.9, 12.0)
)
# observed statistic: average within-block treated-minus-control difference
block_diff <- function(dat) {
num_blocks <- length(unique(dat$block))
diffs <- numeric(num_blocks) ## pre-allocate vector
blocks <- sort(unique(dat$block))
for (j in seq_along(blocks)) {
dsub <- dat[dat$block == blocks[j], ]
diffs[j] <- dsub$y[dsub$trt == "A"] - dsub$y[dsub$trt == "B"]
}
mean(diffs)
}
T_obs <- block_diff(block_dat)
T_obs## [1] 0.7
### Now let's get the permutation distribution (under the H0, there is no difference between treatment and controls)
B <- 3000
T_perm <- numeric(B)
blocks <- unique(block_dat$block)
for (b in 1:B) {
dat_perm <- block_dat
for (g in blocks) {
idx <- which(dat_perm$block == g)
dat_perm$trt[idx] <- sample(dat_perm$trt[idx])
}
T_perm[b] <- block_diff(dat_perm)
}
hist(T_perm,
main = "Block Permutation Distribution",
xlab = "Permuted statistic",
col = "lightgray",
border = "white")
abline(v = T_obs, lwd = 2, lty = 2)## [1] 0.007
aggregate(): a very useful base R toolMany structured resampling methods need us to collapse row-level data to a higher level such as:
Base R has a built-in function for this:
aggregate().
## trt y
## 1 A 11.825
## 2 B 11.125
## block trt y
## 1 1 A 12.4
## 2 2 A 10.1
## 3 3 A 13.0
## 4 4 A 9.5
## 5 5 A 11.7
## 6 6 A 14.2
## 7 7 A 10.8
## 8 8 A 12.9
## 9 1 B 11.8
## 10 2 B 9.6
## 11 3 B 12.2
## 12 4 B 8.9
## 13 5 B 11.1
## 14 6 B 13.4
## 15 7 B 10.0
## 16 8 B 12.0
y ~ trt means: summarize y separately for
each treatment groupy ~ block + trt means: summarize y for
each block-treatment combinationYou will often use aggregate() before permutation or
bootstrap when the observational unit is not the row of the data
frame.
Now suppose treatment was assigned at the cluster or block level.
Examples:
Then the permutation unit is the cluster.
If we permute individual rows instead, we destroy the original randomization design.
## school trt score
## 1 1 control 70.24189
## 2 1 control 71.03897
## 3 1 control 69.86349
## 4 1 control 70.93158
## 5 1 control 68.51149
## 6 1 control 69.96802
## 'data.frame': 72 obs. of 3 variables:
## $ school: int 1 1 1 1 1 1 1 1 1 1 ...
## $ trt : chr "control" "control" "control" "control" ...
## $ score : num 70.2 71 69.9 70.9 68.5 ...
## collapse to school means
school_means <- aggregate(score ~ school + trt, data = school_dat, FUN = mean)
prog_scores <- school_means$score[school_means$trt == "program"]
ctrl_scores <- school_means$score[school_means$trt == "control"]
T_obs <- mean(prog_scores) - mean(ctrl_scores)
# exact cluster permutations
all_prog_sets <- combn(1:nrow(school_means), 3)
T_perm_exact <- numeric(ncol(all_prog_sets))
for (b in 1:ncol(all_prog_sets)) {
prog_idx <- all_prog_sets[, b]
ctrl_idx <- setdiff(1:nrow(school_means), prog_idx)
T_perm_exact[b] <- mean(school_means$score[prog_idx]) -
mean(school_means$score[ctrl_idx])
}
sort(T_perm_exact)## [1] -3.8177558 -2.1355350 -1.9421496 -1.7005620 -1.3379599 -1.1445745
## [7] -0.9726192 -0.9029870 -0.7792338 -0.5376463 0.5376463 0.7792338
## [13] 0.9029870 0.9726192 1.1445745 1.3379599 1.7005620 1.9421496
## [19] 2.1355350 3.8177558
hist(T_perm_exact,
main = "Exact Cluster Permutation Distribution",
xlab = "Permuted difference in cluster means",
col = "lightgray",
border = "white")
abline(v = 0, lwd = 2, col = "blue")
abline(v = T_obs, lwd = 2, lty = 2)## [1] 0.1
When observations within a cluster are dependent, the bootstrap unit should usually be the cluster.
This mimics repeated sampling of independent clusters.
set.seed(321)
# 1. Pre-aggregate to cluster means
cl_means_df <- aggregate(score ~ school + trt, data = school_dat, FUN = mean)
ctrl_means <- cl_means_df$score[cl_means_df$trt == "control"]
prog_means <- cl_means_df$score[cl_means_df$trt == "program"]
# 2. Initialize storage
B <- 2000
T_boot <- numeric(B)
# 3. Stratified Bootstrap Loop
set.seed(321) # for reproducibility
for (b in 1:B) {
# Sample with replacement within each group
resample_ctrl <- sample(ctrl_means, size = length(ctrl_means), replace = TRUE)
resample_prog <- sample(prog_means, size = length(prog_means), replace = TRUE)
# Calculate and store the difference in means
T_boot[b] <- mean(resample_prog) - mean(resample_ctrl)
}
hist(T_boot,
main = "Cluster Bootstrap Distribution",
xlab = "Bootstrap difference in cluster means",
col = "lightgray",
border = "white")
abline(v = T_obs, lwd = 2, lty = 2)## [1] 0.4504037
## 2.5% 97.5%
## 2.83751 4.61670
A structured resampling method can fail if we choose the wrong unit.
Always ask: what was the independent unit in the design or sampling process?
Permutation and bootstrap still follow the same logic as before.
The difference is that the resampling unit must match the data structure.