Getting Started: Simulation and Resampling Methods

Statistical Computing for Data Analysis

Outline

Probability in Statistics

Intuition: How likely is an event to happen?

Formalize this: We study a random variable \(X\) that represents the outcome of a random phenomenon

Example: \(X\) = corn yield

We want to

  1. be able to understand the distribution of our random variable

  2. simulation random draws from this distribution.

Common Distributions

You’ll use these often for simulation:

The Prefix System: d, p, q, r

R uses a consistent naming convention for probability distributions. You append the prefix to the core name of the distribution (e.g., norm, t, pois, binom).

Each function requires the specific parameters that define that distribution’s shape.

Brief Stats Review: Density Functions

Probability Density Function (PDF): \(f(x)\)

# Plotting the Standard Normal Density
x <- seq(-4, 4, length.out = 100)
plot(x, dnorm(x), type = "l", lwd = 2, col = "blue",
     main = "Density Function: dnorm()", ylab = "f(x)")
polygon(c(-2, seq(-2, 0, 0.01), 0), c(0, dnorm(seq(-2, 0, 0.01)), 0), col = "lightblue")

What is the pdf \(f(x)\) evaluated at a specific value?

dnorm(2,0,1) ## f(x=2) assuming X ~ N(0,1)
## [1] 0.05399097

Brief Stats Review: Distribution Functions

Cumulative Distribution Function (CDF): \(F(x)\)

# Plotting the Standard Normal CDF
plot(x, pnorm(x), type = "l", lwd = 2, col = "darkgreen",
     main = "Distribution Function: pnorm()", ylab = "F(x)")
segments(x0 = 0, y0 = 0, x1 = 0, y1 = pnorm(0), lty = 2, col = "red")
segments(x0 = -4, y0 = pnorm(0), x1 = 0, y1 = pnorm(0), lty = 2, col = "red")

What is the CDF \(F(x)\) evaluated at a specific value?

pnorm(2,0,1) ## P(X <= 2) = F(2) assuming X ~ N(0,1)
## [1] 0.9772499

Brief Stats Review: Quantiles

Quantile Function: \(F^{-1}(p)\)

# Finding the 95th quantile
p <- 0.95
q_val <- qnorm(p)
q_val
## [1] 1.644854
# Verifying the inverse relationship: the area to the left is 0.95
pnorm(q_val)
## [1] 0.95

Generating Random Data

Random Number GenerationTo simulate data from a theoretical distribution, we use the r prefix.

In R: rnorm(n, mean = 0, sd = 1) generates a vector of length \(n\) containing pseudo-random draws from the specified normal distribution.

# Simulate 5 standard normal random variables
rnorm(5)
## [1]  0.2057122  1.4940780 -1.7490528 -1.1078900  1.5682049

Reproducibility: set.seed()

Why use set.seed()?

set.seed(4279) 
rnorm(3)
## [1]  1.602368  1.161447 -2.093846
set.seed(4279) # Resetting the seed to the exact same starting point
rnorm(3)       # Yields the exact same draws
## [1]  1.602368  1.161447 -2.093846

Example: Student’s t-Distribution

Continuous, symmetric, but with heavier tails than the normal distribution.

# Generate 1000 random draws from a t-distribution with 5 df
t_draws <- rt(n = 1000, df = 5)

# Compare the density of our sample to the theoretical density
hist(t_draws, probability = TRUE, breaks = 30, col = "lightgray", main = "t-Distribution (df=5)")
curve(dt(x, df = 5), add = TRUE, col = "darkblue", lwd = 2)

Example: Poisson Distribution

Discrete distribution modeling the number of events occurring in a fixed interval.

# Probability of observing exactly 3 events when the average rate is 2
dpois(x = 3, lambda = 2)
## [1] 0.180447
# Generate 10 random counts with rate lambda = 2
rpois(n = 10, lambda = 2)
##  [1] 1 1 2 2 4 2 0 4 1 2

Example: Binomial Distribution

Discrete distribution modeling the number of successes in \(n\) independent trials..

# Probability of getting exactly 7 heads in 10 coin flips (fair coin)
dbinom(x = 7, size = 10, prob = 0.5)
## [1] 0.1171875
# Simulate 5 different experiments of 10 coin flips each
rbinom(n = 5, size = 10, prob = 0.5)
## [1] 6 6 5 5 6

From Theoretical to Empirical Distributions

The Empirical Cumulative Distribution Function (ECDF): \(\hat{F}_n(x)\)

Computing and Plotting the ECDF in R

The ecdf() function takes a numeric vector of data and returns a function that evaluates the step distribution. We can visualize how well our empirical data matches the theoretical distribution we drew it from.

set.seed(4279)
# 1. Generate a small sample from a standard normal distribution
obs_data <- rnorm(30)

# 2. Compute the ECDF
my_ecdf <- ecdf(obs_data)

# 3. Plot the empirical step function
plot(my_ecdf, main = "Empirical vs. Theoretical CDF", xlab = "x", ylab = "F(x)", lwd = 2)

# 4. Overlay the theoretical true CDF (pnorm) for comparison
curve(pnorm(x), add = TRUE, col = "red", lwd = 2, lty = 2)

Random Sampling from Finite Sets: sample()

Moving beyond theoretical probability distributions

Syntax: sample(x, size, replace = FALSE, prob = NULL)

Sampling Without Replacement

Default Behavior (replace = FALSE)

set.seed(4279)
# Draw 5 unique students from a roster of 100
sample(x = 1:100, size = 5, replace = FALSE)
## [1] 10 49 21 85 37
# Permutation: If size equals the length of x, it shuffles the vector
sample(x = c("A", "B", "C"), size = 3, replace = FALSE)
## [1] "C" "A" "B"

Sampling With Replacement

Resampling (replace = TRUE)

set.seed(4279)
# Simulating 10 rolls of a fair 6-sided die
sample(x = 1:6, size = 10, replace = TRUE)
##  [1] 2 1 5 5 5 4 4 5 5 5
# Bootstrap sample: Sampling from our previously generated obs_data
bootstrap_sample <- sample(x = obs_data, size = length(obs_data), replace = TRUE)
head(bootstrap_sample)
## [1]  0.4923485  1.0335527 -0.4363482  1.6023681  0.2630763  3.0152879