Statistical Computing for Data Analysis
So far, we have mostly relied on:
rnorm(), rpois(),
runif(), …But in practice, you often face one (or more) of these problems:
Today: two practical tools for these situations.
Goal: Generate independent draws from a distribution
Goal: Estimate \(E[h(X)]\) or \(P(X \in A)\)
When to use it
You want to simulate from a target density \(f(x)\) but:
Key idea
Sample from an easier proposal distribution \(g(x)\), then accept a draw with some probability so that accepted values follow the target.
Assume:
Algorithm:
Practical note: The acceptance rate is about \(1/M\) (higher is better).
Target density on \([-2,2]\): \[ f(x) \propto \exp(-x^4), \quad -2 \le x \le 2. \]
Because \(\tilde f(x) = \exp(-x^4) \le 1\) and \(g(x)=1/4\), we can use \(M=4\).
set.seed(4279)
f_tilde <- function(x) exp(-x^4) # unnormalized target
g_samp <- function(n) runif(n, min=-2, max=2) # proposal sampler
g_dens <- function(x) dunif(x, min=-2, max=2) # proposal density
rej_samp <- function(n, M = 4) {
out <- numeric(0)
n_try <- 0
while (length(out) < n) {
y <- g_samp(1)
u <- runif(1)
n_try <- n_try + 1
# acceptance probability = f_tilde(y) / (M * g(y))
# Here it simplifies to exp(-y^4).
if (u <= f_tilde(y) / (M * g_dens(y))) out <- c(out, y)
}
list(draws = out, acceptance_rate = n / n_try)
}
sim <- rej_samp(5000)
sim$acceptance_rate## [1] 0.4593477
x <- seq(-2, 2, length.out = 400)
hist(sim$draws, breaks = 40, freq = FALSE, col = "lightgray",
main = "Rejection Sampling: draws from f(x) ∝ exp(-x^4)",
xlab = "x")
# Overlay a *scaled* version of the unnormalized density (for shape only)
curve(f_tilde(x) / integrate(f_tilde, -2, 2)$value, add = TRUE, lwd = 2, col = "blue")Choosing the proposal \(g\)
Red flags in practice
Takeaway: Rejection sampling is great for generating data, but can become inefficient in higher dimensions or with poor proposals.
When to use it
You want to estimate an expectation or probability under a target distribution, e.g.
and naive simulation is inefficient, especially if:
Key idea
Simulate from a different distribution that visits “important” regions more often, and then reweight.
Suppose we care about a very rare event, like:
If we simulate normally (from the usual model), almost every simulated run is a non-failure.
That means:
Instead of sampling under “typical conditions,” we:
Intentionally over-sample high-risk conditions where failures are more likely
Down-weight those high-risk samples so we don’t exaggerate the overall risk
So we get:
But after we collect needles from the smaller pile, we correct for the fact that we looked in a biased place by applying weights.
A weight is a correction factor that says:
“How much should this sample count when estimating the original probability?”
You can think of weights as undoing the bias introduced by “oversampling risk.”
Our Goal: estimate an average or probability under the original model \(f(x)\): \[ \mu = \mathbb{E}_{f}[h(X)]. \]
We simulate from a different model \(g(x)\) and correct with weights: \[ \mu = \int h(x)\,f(x)\,dx = \int h(x)\,\frac{f(x)}{g(x)}\,g(x)\,dx = \mathbb{E}_{g}\left[h(X)\,\underbrace{\frac{f(X)}{g(X)}}_{\text{weight}}\right]. \]
That’s all: sample from \(g\), then reweight to estimate under \(f\).
Your “high-risk” sampling distribution should:
Rule of thumb for tail events:
Bad choice: your high-risk sampler avoids some region where the original model can go.
Result:
In practice, you should always check:
Target: \(Z \sim N(0,1)\). We want \[ p = P(Z > 4). \]
This is rare (about 3e-5), so with naive sampling you might see zero events unless \(n\) is huge.
Proposal: \(Z \sim N(\mu, 1)\) with \(\mu=4\), so we generate more tail values near 4.
set.seed(4279)
n <- 50000
threshold <- 4
# True value (for checking only)
p_true <- 1 - pnorm(threshold)
# 1) Naive MC: Z ~ N(0,1)
z_naive <- rnorm(n)
p_hat_naive <- mean(z_naive > threshold)
# 2) Importance sampling: sample from N(mu,1)
mu <- 4
z_is <- rnorm(n, mean = mu, sd = 1)
# weights: f/g = dnorm(z;0,1) / dnorm(z;mu,1)
w <- dnorm(z_is, mean = 0, sd = 1) / dnorm(z_is, mean = mu, sd = 1)
p_hat_is <- mean((z_is > threshold) * w)
c(p_true = p_true, naive = p_hat_naive, importance = p_hat_is)## p_true naive importance
## 3.167124e-05 2.000000e-05 3.158224e-05
Importance sampling can fail if weights are extremely variable.
A simple diagnostic is the effective sample size: \[ \text{ESS} = \frac{\left(\sum_i w_i\right)^2}{\sum_i w_i^2}. \]
## n ESS ESS_fraction
## 5.000000e+04 6.514954e+00 1.302991e-04