Statistical Computing for Data Analysis
As your code grows, copy/paste becomes a problem:
A function lets you:
name <- function(arg1, arg2) {
# compute something
result <- ...
return(result) # optional (R returns the last expression)
}Key pieces:
## [1] 32
## [1] 32 50 68
Notes:
return() needed here because the last line
is the result.You can give arguments default values.
## [1] 0.0 0.0 0.3 1.0
## [1] -1.0 -0.1 0.3 1.2
Defaults make functions easier to use while still flexible.
## [1] 0.0 0.2 1.0
## [1] 0.0 0.2 1.0
Recommendation for readability:
Return a list if you want more than one thing.
summarize_vec <- function(x) {
list(
n = length(x),
mean = mean(x),
sd = sd(x),
min = min(x),
max = max(x)
)
}
out <- summarize_vec(c(1, 2, 3, 10))
out## $n
## [1] 4
##
## $mean
## [1] 4
##
## $sd
## [1] 4.082483
##
## $min
## [1] 1
##
## $max
## [1] 10
## [1] 4
Variables created inside a function typically stay inside.
## [1] 9
This is a feature: functions help avoid accidental name collisions.
A function should fail early with a useful message.
safe_log <- function(x) {
if (!is.numeric(x)) stop("x must be numeric")
if (any(x <= 0)) stop("x must be positive")
log(x)
}
safe_log(c(1, 2, 10))## [1] 0.0000000 0.6931472 2.3025851
safe_log(c(1, -2, 3)) ## this line works interactively. But if you tried to knit, any error stops the entire document build unless you explicitly allow errors.## Error in `safe_log()`:
## ! x must be positive
stop(), warning(), and
message()stop("..."): error; function exits immediatelywarning("..."): continues, but alerts the usermessage("..."): informational (often for progress)Goal: convert to z-scores: \(z_i = (x_i - \bar{x})/s\)
zscore <- function(x, na.rm = TRUE) {
if (!is.numeric(x)) stop("x must be numeric")
mu <- mean(x, na.rm = na.rm)
s <- sd(x, na.rm = na.rm)
if (s == 0) stop("sd is 0; cannot standardize")
(x - mu) / s
}
zscore(c(10, 12, 14))## [1] -1 0 1
## [1] -0.7071068 NA 0.7071068
Sometimes you need explicit loops (e.g., custom algorithms).
cumprod_loop <- function(x) {
if (!is.numeric(x)) stop("x must be numeric")
out <- numeric(length(x))
prod_so_far <- 1
for (i in seq_along(x)) {
prod_so_far <- prod_so_far * x[i]
out[i] <- prod_so_far
}
out
}
cumprod_loop(c(2, 3, 4))## [1] 2 6 24
This is slow:
Do this instead (pre-allocate):
A good function is: