Statistical Computing for Data Analysis
Intuition: How likely is an event to happen?
Formalize this: We study a random variable \(X\) that represents the outcome of a random phenomenon
Example: \(X\) = corn yield
We want to
be able to understand the distribution of our random variable
simulation random draws from this distribution.
You’ll use these often for simulation:
R uses a consistent naming convention for probability distributions.
You append the prefix to the core name of the distribution (e.g.,
norm, t, pois,
binom).
d: Density function (or probability mass functionfor
discrete distributions)p: Probability (Cumulative Distribution Function)q: Quantile (Inverse CDF)r: Random generationEach function requires the specific parameters that define that distribution’s shape.
Probability Density Function (PDF): \(f(x)\)
dnorm(x, mean = 0, sd = 1)
evaluates the density \(f(x)\) at \(x\).# Plotting the Standard Normal Density
x <- seq(-4, 4, length.out = 100)
plot(x, dnorm(x), type = "l", lwd = 2, col = "blue",
main = "Density Function: dnorm()", ylab = "f(x)")
polygon(c(-2, seq(-2, 0, 0.01), 0), c(0, dnorm(seq(-2, 0, 0.01)), 0), col = "lightblue")What is the pdf \(f(x)\) evaluated at a specific value?
## [1] 0.05399097
Cumulative Distribution Function (CDF): \(F(x)\)
pnorm(q, mean = 0, sd = 1)
calculates this left-tail probability for a given value \(q\).# Plotting the Standard Normal CDF
plot(x, pnorm(x), type = "l", lwd = 2, col = "darkgreen",
main = "Distribution Function: pnorm()", ylab = "F(x)")
segments(x0 = 0, y0 = 0, x1 = 0, y1 = pnorm(0), lty = 2, col = "red")
segments(x0 = -4, y0 = pnorm(0), x1 = 0, y1 = pnorm(0), lty = 2, col = "red")What is the CDF \(F(x)\) evaluated at a specific value?
## [1] 0.9772499
Quantile Function: \(F^{-1}(p)\)
qnorm(p, mean = 0, sd = 1) finds
the value corresponding to the \(p\)-th
probability.## [1] 1.644854
## [1] 0.95
Random Number GenerationTo simulate data from a theoretical
distribution, we use the r prefix.
In R: rnorm(n, mean = 0, sd = 1)
generates a vector of length \(n\)
containing pseudo-random draws from the specified normal
distribution.
## [1] 0.2057122 1.4940780 -1.7490528 -1.1078900 1.5682049
Why use set.seed()?
## [1] 1.602368 1.161447 -2.093846
set.seed(4279) # Resetting the seed to the exact same starting point
rnorm(3) # Yields the exact same draws## [1] 1.602368 1.161447 -2.093846
Continuous, symmetric, but with heavier tails than the normal distribution.
tdf (degrees of freedom)# Generate 1000 random draws from a t-distribution with 5 df
t_draws <- rt(n = 1000, df = 5)
# Compare the density of our sample to the theoretical density
hist(t_draws, probability = TRUE, breaks = 30, col = "lightgray", main = "t-Distribution (df=5)")
curve(dt(x, df = 5), add = TRUE, col = "darkblue", lwd = 2)Discrete distribution modeling the number of events occurring in a fixed interval.
poislambda (\(\lambda\), the expected rate of
occurrence)dpois() calculates the
exact Probability Mass Function (PMF), \(P(X =
x)\).## [1] 0.180447
## [1] 1 1 2 2 4 2 0 4 1 2
Discrete distribution modeling the number of successes in \(n\) independent trials..
binomsize (number of trials, \(n\)) and prob (probability of
success on each trial, \(p\)# Probability of getting exactly 7 heads in 10 coin flips (fair coin)
dbinom(x = 7, size = 10, prob = 0.5)## [1] 0.1171875
## [1] 6 6 5 5 6
The Empirical Cumulative Distribution Function (ECDF): \(\hat{F}_n(x)\)
pnorm() to evaluate
the theoretical CDF, \(F(x)\),
of a known probability distribution.The ecdf() function takes a numeric vector of data and
returns a function that evaluates the step distribution. We can
visualize how well our empirical data matches the theoretical
distribution we drew it from.
set.seed(4279)
# 1. Generate a small sample from a standard normal distribution
obs_data <- rnorm(30)
# 2. Compute the ECDF
my_ecdf <- ecdf(obs_data)
# 3. Plot the empirical step function
plot(my_ecdf, main = "Empirical vs. Theoretical CDF", xlab = "x", ylab = "F(x)", lwd = 2)
# 4. Overlay the theoretical true CDF (pnorm) for comparison
curve(pnorm(x), add = TRUE, col = "red", lwd = 2, lty = 2)Moving beyond theoretical probability distributions
The r-prefix functions (rnorm, rpois)
generate data by sampling from theoretical probability models.
When we want to sample directly from an existing, finite vector of data (like our obs_data vector, or a specific set of categories), we use base R’s sample() function.
Syntax:
sample(x, size, replace = FALSE, prob = NULL)
x: A vector of elements to choose from.size: The number of items to choose.replace: Should sampling be with or without
replacement?Default Behavior (replace = FALSE)
x.set.seed(4279)
# Draw 5 unique students from a roster of 100
sample(x = 1:100, size = 5, replace = FALSE)## [1] 10 49 21 85 37
# Permutation: If size equals the length of x, it shuffles the vector
sample(x = c("A", "B", "C"), size = 3, replace = FALSE)## [1] "C" "A" "B"
Resampling (replace = TRUE)
size argument can be arbitrarily large.set.seed(4279)
# Simulating 10 rolls of a fair 6-sided die
sample(x = 1:6, size = 10, replace = TRUE)## [1] 2 1 5 5 5 4 4 5 5 5
# Bootstrap sample: Sampling from our previously generated obs_data
bootstrap_sample <- sample(x = obs_data, size = length(obs_data), replace = TRUE)
head(bootstrap_sample)## [1] 0.4923485 1.0335527 -0.4363482 1.6023681 0.2630763 3.0152879