Statistical Computing for Data Analysis
So far, permutation has been introduced as a way to approximate a null distribution when we do not know or want to assume the data’s distribution under the \(H_0\).
Next, we look at settings where the test statistic is more complex than a difference in means.
Examples:
The logic is still the same:
Suppose we want to test whether two variables are associated.
A natural test statistic is the sample correlation
\[ T = r(x,y). \]
Under the null hypothesis of no association, the observed pairing between \(x\) and \(y\) should not matter.
So we can:
The resulting distribution is a permutation distribution for the correlation under \(H_0\).
set.seed(123)
x <- c(2.1, 3.4, 4.0, 4.8, 5.2, 6.1, 6.5, 7.3, 8.0, 8.7,
9.1, 9.8, 10.4, 11.2, 11.8)
y <- c(3.2, 4.1, 4.5, 4.9, 5.7, 6.0, 6.8, 7.4, 7.9, 8.8,
8.9, 9.5, 10.2, 10.8, 11.6)
T_obs <- cor(x, y)
B <- 3000
T_perm <- numeric(B)
for (b in 1:B) {
y_labels <- sample(1:length(y))
y_perm <- y[y_labels]
T_perm[b] <- cor(x, y_perm)
}
hist(T_perm,
main = "Permutation Distribution of the Correlation",
xlab = "Permuted correlation",
col = "lightgray",
border = "white")
abline(v = T_obs, lwd = 2, lty = 2)## [1] 0
Permutation is also useful when the statistic is not one of the usual textbook choices.
For example, if we want to compare two groups but care about a difference in medians, we can use
\[ T = \text{median}(X_1) - \text{median}(X_2). \]
Do we know how to derive the null distribution of the median using statistical theory?
If not, we can just use permutation. Under the null hypothesis of no group difference, the labels are exchangeable.
So we can permute the labels exactly as before, but now recompute a difference in medians instead of a difference in means.
Permutation is flexible because it does not require the test statistic to come from a named classical test.
So far, we have used simulation and permutation to approximate null distributions for hypothesis testing.
But in statistics, we also need distributions for other inferential tasks, such as:
That leads us to another major resampling method: the bootstrap.
Permutation is mainly used to approximate a null distribution.
Bootstrap is mainly used to approximate the sampling distribution of an estimator.
Treat the observed sample as a stand-in for the population, then repeatedly sample with replacement from the observed data.
Suppose our statistic of interest is \(T\).
The bootstrap procedure is:
The resulting values approximate the sampling distribution of the statistic.
The sample median is often harder to study analytically than the sample mean.
Suppose our statistic is
\[ T = \text{median}(X). \]
Suppose we want to estimate its standard error. How could we do that?
We can use the bootstrap to approximate the sampling distribution of the sample median and estimate its standard error.
set.seed(123)
x <- c(5.1, 6.2, 4.9, 5.7, 100.0, 6.0, 5.4, 5.8, 6.1, 5.6)
T_obs <- median(x)
B <- 3000
T_boot <- numeric(B)
n <- length(x)
for (b in 1:B) {
x_boot <- sample(x, size = n, replace = TRUE)
T_boot[b] <- median(x_boot)
}
hist(T_boot,
main = "Bootstrap Distribution of the Sample Median",
xlab = "Bootstrap median",
col = "lightgray",
border = "white")
abline(v = T_obs, lwd = 2, lty = 2)## [1] 1.504813
## 2.5% 97.5%
## 5.4 6.1
We can also bootstrap statistics like the sample correlation.
If our statistic is
\[ T = r(x,y), \]
we can:
For bootstrap, we usually resample the observed units or pairs, not the variables separately.
set.seed(123)
x <- c(2.1, 3.4, 4.0, 4.8, 5.2, 6.1, 6.5, 7.3, 8.0, 8.7,
9.1, 9.8, 10.4, 11.2, 11.8)
y <- c(3.2, 4.1, 4.5, 4.9, 5.7, 6.0, 6.8, 7.4, 7.9, 8.8,
8.9, 9.5, 10.2, 10.8, 11.6)
dat <- data.frame(x = x, y = y)
T_obs <- cor(dat$x, dat$y)
B <- 3000
T_boot <- numeric(B)
n <- nrow(dat)
for (b in 1:B) {
idx <- sample(1:n, size = n, replace = TRUE)
dat_boot <- dat[idx, ]
T_boot[b] <- cor(dat_boot$x, dat_boot$y)
}
hist(T_boot,
main = "Bootstrap Distribution of the Correlation",
xlab = "Bootstrap correlation",
col = "lightgray",
border = "white")
abline(v = T_obs, lwd = 2, lty = 2)## [1] 0.001172787
## 2.5% 97.5%
## 0.9947673 0.9991965
Suppose we fit the model
\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
and want uncertainty for \(\hat{\beta}_1\).
We can:
This gives a bootstrap approximation to the sampling distribution of \(\hat{\beta}_1\).
library(ISLR2)
data(Boston)
fit_obs <- lm(medv ~ lstat, data = Boston)
T_obs <- coef(fit_obs)["lstat"]
B <- 2000
T_boot <- numeric(B)
n <- nrow(Boston)
for (b in 1:B) {
idx <- sample(1:n, size = n, replace = TRUE)
boot_dat <- Boston[idx, ]
fit_boot <- lm(medv ~ lstat, data = boot_dat)
T_boot[b] <- coef(fit_boot)["lstat"]
}
hist(T_boot,
main = "Bootstrap Distribution of the Regression Slope",
xlab = "Bootstrap slope estimate",
col = "lightgray",
border = "white")
abline(v = T_obs, lwd = 2, lty = 2)## [1] 0.0494498
## 2.5% 97.5%
## -1.0521315 -0.8568269
Although both involve repeated resampling, they answer different questions.
Different resampling methods approximate different distributions used in inference.