Statistical Computing for Data Analysis
R offers a family of apply functions, which allow
you to apply a function across different chunks of data. Offers an
alternative to explicit iteration using for() loop; can be
simpler and faster, though not always and can lack
readability. Summary of functions:
apply(): apply a function to rows or columns of a
matrix or data framelapply(): apply a function to elements of a list or
vectorsapply() and vapply(): same as the above,
but simplify the output (if possible)tapply(): apply a function to levels of a factor
vectorapply(): rows or columns of a matrix or data frameThe apply() function takes inputs of the following
form:
apply(x, MARGIN=1, FUN=my.fun), to apply
my.fun() across rows of a matrix or data frame
xapply(x, MARGIN=2, FUN=my.fun), to apply
my.fun() across columns of a matrix or data frame
xset.seed(123)
X <- matrix(rnorm(20), nrow = 5, ncol = 4)
out <- numeric(nrow(X))
for (i in 1:nrow(X)) {
out[i] <- mean(X[i, ])
}
out <- apply(X, 1, mean)apply() is still a loop. You just don’t see the
counter.
## [1] "matrix" "array"
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## 365.00 3098.00 0.50 67.96 1.40 37.80 0.00
## Area
## 1049.00
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## 21198.0 6315.0 2.8 73.6 15.1 67.3 188.0
## Area
## 566432.0
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## 5 2 18 11 1 44 28
## Area
## 2
Don’t overuse the apply paradigm! There’s lots of
special functions that optimized will be both simpler
and faster than using apply(). E.g.,
rowSums(), colSums(): for computing row,
column sums of a matrixrowMeans(), colMeans(): for computing row,
column means of a matrixmax.col(): for finding the maximum position in each row
of a matrixCombining these functions with logical indexing and vectorized operations will enable you to do quite a lot. E.g., how to count the number of positives in each row of a matrix?
x = matrix(rnorm(9), 3, 3)
# Don't do this (much slower for big matrices)
apply(x, MARGIN=1, function(v) { return(sum(v > 0)) })## [1] 1 1 0
## [1] 1 1 0
lapply(): elements of a list or vectorlapply() applies a function to each element of a list
(or vector, which is treated as a list of length-1 elements).
Suppose we have a function called my.fun(). Don’t worry
about this syntax too much yet, we’ll talk about how to define custom
functions in the next set of slides.
## $a
## [1] 1 2 3 4 5
##
## $b
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $c
## [1] 1.25381492 0.42646422 -0.29507148 0.89512566 0.87813349 0.82158108
## [7] 0.68864025 0.55391765 -0.06191171 -0.30596266 -0.38047100 -0.69470698
## [13] -0.20791728 -1.26539635 2.16895597 1.20796200 -1.12310858 -0.40288484
## [19] -0.46665535 0.77996512
out <- vector("list", length(my_list))
for (i in 1:length(my_list)) {
out[[i]] <- my_fun(my_list[[i]])
}
out## [[1]]
## [1] 4.581139
##
## [[2]]
## [1] 8.52765
##
## [[3]]
## [1] 1.101703
lapply() version:
## $a
## [1] 4.581139
##
## $b
## [1] 8.52765
##
## $c
## [1] 1.101703
lapply() is still a loop. The counter is hidden.
## $a
## [1] 3
##
## $b
## [1] 5.5
##
## $c
## [1] 0.2235237
## $a
## [1] 5
##
## $b
## [1] 10
##
## $c
## [1] 20
Note: the output is always a list.
sapply(): elements of a list or vectorThe sapply() function works just like
lapply(), but tries to simplify the return
value whenever possible. E.g., most common is the conversion from a list
to a vector
If results are dimension one: returns a vector
If results are same dimension > one: returns a matrix
Otherwise: returns a list
Observe the difference in structure.
## $a
## [1] 3
##
## $b
## [1] 5.5
##
## $c
## [1] 0.06571215
## a b c
## 3.00000000 5.50000000 0.06571215
Why be cautious with sapply()?
Because the output type depends on the result. This can cause subtle bugs if structure changes.
For example:
## $a
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 2 3 3 4 5
##
## $b
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.25 5.50 5.50 7.75 10.00
##
## $c
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.54875 -0.25263 0.08843 0.06571 0.39678 1.51647
## a b c
## Min. 1 1.00 -1.54875280
## 1st Qu. 2 3.25 -0.25263009
## Median 3 5.50 0.08842924
## Mean 3 5.50 0.06571215
## 3rd Qu. 4 7.75 0.39678206
## Max. 5 10.00 1.51647060
The output may be a matrix — or not — depending on what
summary() returns.
vapply(): Type-safe version of
sapply()vapply() is a safer version of
sapply().
You must explicitly specify the expected output type.
Form: vapply(X, FUN, FUN.VALUE = numeric(1))
FUN.VALUE tells R what each return value should look
like.
## a b c
## 3.00000000 5.50000000 0.06571215
This guarantees:
More robust inside larger functions.
tapply(): levels of a factor vectortapply() applies a function to subsets of a vector
defined by a factor.
Form: tapply(X, INDEX = group, FUN = my.fun)
# Compute the mean and sd of the Frost variable, within each region
tapply(state.x77[,"Frost"], INDEX=state.region, FUN=mean)## Northeast South North Central West
## 132.7778 64.6250 138.8333 102.1538
## Northeast South North Central West
## 30.89408 31.30682 23.89307 68.87652
## [1] "factor"
## state.region
## Northeast South North Central West
## 9 16 12 13
Equivalent loop logic:
split(): split by levels of a factorThe function split() split up the rows of a data frame
by levels of a factor, as in: split(x, f=my.index) to split
a data frame x according to levels of
my.index
# Split up the state.x77 matrix according to region
state.by.reg = split(data.frame(state.x77), f=state.region)
class(state.by.reg) # The result is a list## [1] "list"
## [1] "Northeast" "South" "North Central" "West"
## [1] "data.frame"
## $Northeast
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Connecticut 3100 5348 1.1 72.48 3.1 56.0 139 4862
## Maine 1058 3694 0.7 70.39 2.7 54.7 161 30920
## Massachusetts 5814 4755 1.1 71.83 3.3 58.5 103 7826
##
## $South
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## Delaware 579 4809 0.9 70.06 6.2 54.6 103 1982
##
## $`North Central`
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Illinois 11197 5107 0.9 70.14 10.3 52.6 127 55748
## Indiana 5313 4458 0.7 70.88 7.1 52.9 122 36097
## Iowa 2861 4628 0.5 72.56 2.3 59.0 140 55941
##
## $West
## Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
# For each region, average each of the 8 numeric variables
lapply(state.by.reg, FUN=function(df) {
return(apply(df, MARGIN=2, mean))
})## $Northeast
## Population Income Illiteracy Life.Exp Murder HS.Grad
## 5495.111111 4570.222222 1.000000 71.264444 4.722222 53.966667
## Frost Area
## 132.777778 18141.000000
##
## $South
## Population Income Illiteracy Life.Exp Murder HS.Grad
## 4208.12500 4011.93750 1.73750 69.70625 10.58125 44.34375
## Frost Area
## 64.62500 54605.12500
##
## $`North Central`
## Population Income Illiteracy Life.Exp Murder HS.Grad
## 4803.00000 4611.08333 0.70000 71.76667 5.27500 54.51667
## Frost Area
## 138.83333 62652.00000
##
## $West
## Population Income Illiteracy Life.Exp Murder HS.Grad
## 2.915308e+03 4.702615e+03 1.023077e+00 7.123462e+01 7.215385e+00 6.200000e+01
## Frost Area
## 1.021538e+02 1.344630e+05
These functions are structured abstractions over loops:
for() → explicit iteration
apply() → loop over rows/column
lapply() → loop over list elements
sapply() → loop + auto-simplify
vapply() → loop + enforced type
tapply() → loop over groups
All of them are still iteration.
The only thing removed is the visible counter.
apply() vs for()?Use for() when:
Use apply() family when: