Structured Resampling: Blocks and Clusters

Statistical Computing for Data Analysis

Structured resampling

So far, we have mostly discussed permutation and bootstrap for iid rows.

But many datasets have additional structure:

Main idea

The hard part is usually not writing the loop.

The hard part is deciding what the resampling unit should be.

To choose a valid permutation or bootstrap method, we need to ask:

If we ignore the design of the study or the dependence structure in the data, a resampling method can answer the wrong question.

When ordinary resampling fails

Ordinary resampling works well when observations can be treated as interchangeable individual rows.

But in many settings, the data have additional structure that must be preserved.

Design structure

Sometimes the resampling scheme should reflect how the data were collected or how treatment was assigned.

Dependence structure

Sometimes the resampling scheme should reflect which observations are dependent.

Takeaway

The resampling unit should match the unit of exchangeability or independence.

Block permutation

If treatment was randomized within each block, then under the null hypothesis the treatment labels are exchangeable within block.

This preserves the original blocking structure.

Block permutation: example in R

set.seed(4279)

block_dat <- data.frame(
  block = rep(1:8, each = 2),
  trt = rep(c("A", "B"), times = 8),
  y = c(12.4, 11.8,
        10.1,  9.6,
        13.0, 12.2,
         9.5,  8.9,
        11.7, 11.1,
        14.2, 13.4,
        10.8, 10.0,
        12.9, 12.0)
)

# observed statistic: average within-block treated-minus-control difference
block_diff <- function(dat) {
  num_blocks <- length(unique(dat$block))
  diffs <- numeric(num_blocks) ## pre-allocate vector 
  blocks <- sort(unique(dat$block))

  for (j in seq_along(blocks)) {
    dsub <- dat[dat$block == blocks[j], ]
    diffs[j] <- dsub$y[dsub$trt == "A"] - dsub$y[dsub$trt == "B"]
  }

  mean(diffs)
}

T_obs <- block_diff(block_dat)
T_obs
## [1] 0.7

Permutation Null Distribution

### Now let's get the permutation distribution (under the H0, there is no difference between treatment and controls)
B <- 3000
T_perm <- numeric(B)
blocks <- unique(block_dat$block)

for (b in 1:B) {
  dat_perm <- block_dat

  for (g in blocks) {
    idx <- which(dat_perm$block == g)
    dat_perm$trt[idx] <- sample(dat_perm$trt[idx])
  }

  T_perm[b] <- block_diff(dat_perm)
}

hist(T_perm,
     main = "Block Permutation Distribution",
     xlab = "Permuted statistic",
     col = "lightgray",
     border = "white")
abline(v = T_obs, lwd = 2, lty = 2)

mean(abs(T_perm) >= abs(T_obs))
## [1] 0.007

aggregate(): a very useful base R tool

Many structured resampling methods need us to collapse row-level data to a higher level such as:

Base R has a built-in function for this: aggregate().

Common forms

# one grouping variable
aggregate(y ~ trt, data = block_dat, FUN = mean)
##   trt      y
## 1   A 11.825
## 2   B 11.125
# more than one grouping variable
aggregate(y ~ block + trt, data = block_dat, FUN = mean)
##    block trt    y
## 1      1   A 12.4
## 2      2   A 10.1
## 3      3   A 13.0
## 4      4   A  9.5
## 5      5   A 11.7
## 6      6   A 14.2
## 7      7   A 10.8
## 8      8   A 12.9
## 9      1   B 11.8
## 10     2   B  9.6
## 11     3   B 12.2
## 12     4   B  8.9
## 13     5   B 11.1
## 14     6   B 13.4
## 15     7   B 10.0
## 16     8   B 12.0

Interpretation

You will often use aggregate() before permutation or bootstrap when the observational unit is not the row of the data frame.

Cluster permutation: idea

Now suppose treatment was assigned at the cluster or block level.

Examples:

Then the permutation unit is the cluster.

Cluster permutation

If we permute individual rows instead, we destroy the original randomization design.

Cluster permutation: example in R

head(school_dat)
##   school     trt    score
## 1      1 control 70.24189
## 2      1 control 71.03897
## 3      1 control 69.86349
## 4      1 control 70.93158
## 5      1 control 68.51149
## 6      1 control 69.96802
str(school_dat)
## 'data.frame':    72 obs. of  3 variables:
##  $ school: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ trt   : chr  "control" "control" "control" "control" ...
##  $ score : num  70.2 71 69.9 70.9 68.5 ...
## collapse to school means
school_means <- aggregate(score ~ school + trt, data = school_dat, FUN = mean)

prog_scores <- school_means$score[school_means$trt == "program"]
ctrl_scores <- school_means$score[school_means$trt == "control"]

T_obs <- mean(prog_scores) - mean(ctrl_scores)

# exact cluster permutations
all_prog_sets <- combn(1:nrow(school_means), 3)

T_perm_exact <- numeric(ncol(all_prog_sets))

for (b in 1:ncol(all_prog_sets)) {
  prog_idx <- all_prog_sets[, b]
  ctrl_idx <- setdiff(1:nrow(school_means), prog_idx)

  T_perm_exact[b] <- mean(school_means$score[prog_idx]) -
                     mean(school_means$score[ctrl_idx])
}

sort(T_perm_exact)
##  [1] -3.8177558 -2.1355350 -1.9421496 -1.7005620 -1.3379599 -1.1445745
##  [7] -0.9726192 -0.9029870 -0.7792338 -0.5376463  0.5376463  0.7792338
## [13]  0.9029870  0.9726192  1.1445745  1.3379599  1.7005620  1.9421496
## [19]  2.1355350  3.8177558
hist(T_perm_exact,
     main = "Exact Cluster Permutation Distribution",
     xlab = "Permuted difference in cluster means",
     col = "lightgray",
     border = "white")
abline(v = 0, lwd = 2, col = "blue")
abline(v = T_obs, lwd = 2, lty = 2)

mean(abs(T_perm_exact) >= abs(T_obs))
## [1] 0.1

Cluster bootstrap: idea

When observations within a cluster are dependent, the bootstrap unit should usually be the cluster.

Cluster bootstrap

This mimics repeated sampling of independent clusters.

Typical use cases

Cluster bootstrap: example in R

set.seed(321)

# 1. Pre-aggregate to cluster means
cl_means_df <- aggregate(score ~ school + trt, data = school_dat, FUN = mean)
ctrl_means  <- cl_means_df$score[cl_means_df$trt == "control"]
prog_means  <- cl_means_df$score[cl_means_df$trt == "program"]

# 2. Initialize storage
B <- 2000
T_boot <- numeric(B)

# 3. Stratified Bootstrap Loop
set.seed(321) # for reproducibility
for (b in 1:B) {
  # Sample with replacement within each group
  resample_ctrl <- sample(ctrl_means, size = length(ctrl_means), replace = TRUE)
  resample_prog <- sample(prog_means, size = length(prog_means), replace = TRUE)
  
  # Calculate and store the difference in means
  T_boot[b] <- mean(resample_prog) - mean(resample_ctrl)
}

hist(T_boot,
     main = "Cluster Bootstrap Distribution",
     xlab = "Bootstrap difference in cluster means",
     col = "lightgray",
     border = "white")
abline(v = T_obs, lwd = 2, lty = 2)

sd(T_boot)
## [1] 0.4504037
quantile(T_boot, c(0.025, 0.975))
##    2.5%   97.5% 
## 2.83751 4.61670

Common mistakes

A structured resampling method can fail if we choose the wrong unit.

Examples

Takeaway

Always ask: what was the independent unit in the design or sampling process?

Big picture

Permutation and bootstrap still follow the same logic as before.

The difference is that the resampling unit must match the data structure.