Control Flow: apply() family

Statistical Computing for Data Analysis

The apply family

R offers a family of apply functions, which allow you to apply a function across different chunks of data. Offers an alternative to explicit iteration using for() loop; can be simpler and faster, though not always and can lack readability. Summary of functions:

apply(): rows or columns of a matrix or data frame

The apply() function takes inputs of the following form:

set.seed(123)

X <- matrix(rnorm(20), nrow = 5, ncol = 4)

out <- numeric(nrow(X))

for (i in 1:nrow(X)) {
  out[i] <- mean(X[i, ])
}

out <- apply(X, 1, mean)

apply() is still a loop. You just don’t see the counter.

class(state.x77) # Built-in matrix of states data, 50 states x 8 variables
## [1] "matrix" "array"
head(state.x77) 
##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766
apply(state.x77, MARGIN=2, FUN=min) # Minimum entry in each column
## Population     Income Illiteracy   Life Exp     Murder    HS Grad      Frost 
##     365.00    3098.00       0.50      67.96       1.40      37.80       0.00 
##       Area 
##    1049.00
apply(state.x77, MARGIN=2, FUN=max) # Maximum entry in each column
## Population     Income Illiteracy   Life Exp     Murder    HS Grad      Frost 
##    21198.0     6315.0        2.8       73.6       15.1       67.3      188.0 
##       Area 
##   566432.0
apply(state.x77, MARGIN=2, FUN=which.max) # Index of the max in each column
## Population     Income Illiteracy   Life Exp     Murder    HS Grad      Frost 
##          5          2         18         11          1         44         28 
##       Area 
##          2

Optimized functions for special tasks

Don’t overuse the apply paradigm! There’s lots of special functions that optimized will be both simpler and faster than using apply(). E.g.,

Combining these functions with logical indexing and vectorized operations will enable you to do quite a lot. E.g., how to count the number of positives in each row of a matrix?

x = matrix(rnorm(9), 3, 3)
# Don't do this (much slower for big matrices)
apply(x, MARGIN=1, function(v) { return(sum(v > 0)) })
## [1] 1 1 0
# Do this insted (much faster, simpler)
rowSums(x > 0)
## [1] 1 1 0

lapply(): elements of a list or vector

lapply() applies a function to each element of a list (or vector, which is treated as a list of length-1 elements).

Suppose we have a function called my.fun(). Don’t worry about this syntax too much yet, we’ll talk about how to define custom functions in the next set of slides.

my_fun <- function(x) {
  mean(x) + sd(x)
}

my_list <- list(a = 1:5, b = 1:10, c = rnorm(20))
my_list
## $a
## [1] 1 2 3 4 5
## 
## $b
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $c
##  [1]  1.25381492  0.42646422 -0.29507148  0.89512566  0.87813349  0.82158108
##  [7]  0.68864025  0.55391765 -0.06191171 -0.30596266 -0.38047100 -0.69470698
## [13] -0.20791728 -1.26539635  2.16895597  1.20796200 -1.12310858 -0.40288484
## [19] -0.46665535  0.77996512
out <- vector("list", length(my_list))

for (i in 1:length(my_list)) {
  out[[i]] <- my_fun(my_list[[i]])
}

out
## [[1]]
## [1] 4.581139
## 
## [[2]]
## [1] 8.52765
## 
## [[3]]
## [1] 1.101703

lapply() version:

out <- lapply(my_list, my_fun)
out
## $a
## [1] 4.581139
## 
## $b
## [1] 8.52765
## 
## $c
## [1] 1.101703

lapply() is still a loop. The counter is hidden.

lapply(my_list, mean)
## $a
## [1] 3
## 
## $b
## [1] 5.5
## 
## $c
## [1] 0.2235237
lapply(my_list, length)
## $a
## [1] 5
## 
## $b
## [1] 10
## 
## $c
## [1] 20

Note: the output is always a list.

sapply(): elements of a list or vector

The sapply() function works just like lapply(), but tries to simplify the return value whenever possible. E.g., most common is the conversion from a list to a vector

Observe the difference in structure.

my_list <- list(a = 1:5, b = 1:10, c = rnorm(20))

lapply(my_list, FUN=mean)
## $a
## [1] 3
## 
## $b
## [1] 5.5
## 
## $c
## [1] 0.06571215
sapply(my_list, FUN=mean) # Simplifies the result, now a vector
##          a          b          c 
## 3.00000000 5.50000000 0.06571215

Why be cautious with sapply()?

Because the output type depends on the result. This can cause subtle bugs if structure changes.

For example:

lapply(my_list, FUN=summary) 
## $a
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       2       3       3       4       5 
## 
## $b
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.50    5.50    7.75   10.00 
## 
## $c
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.54875 -0.25263  0.08843  0.06571  0.39678  1.51647
sapply(my_list, FUN=summary) ## returns a matrix
##         a     b           c
## Min.    1  1.00 -1.54875280
## 1st Qu. 2  3.25 -0.25263009
## Median  3  5.50  0.08842924
## Mean    3  5.50  0.06571215
## 3rd Qu. 4  7.75  0.39678206
## Max.    5 10.00  1.51647060

The output may be a matrix — or not — depending on what summary() returns.

vapply(): Type-safe version of sapply()

vapply() is a safer version of sapply().

You must explicitly specify the expected output type.

Form: vapply(X, FUN, FUN.VALUE = numeric(1))

FUN.VALUE tells R what each return value should look like.

vapply(my_list, mean, FUN.VALUE = numeric(1))
##          a          b          c 
## 3.00000000 5.50000000 0.06571215

This guarantees:

More robust inside larger functions.

tapply(): levels of a factor vector

tapply() applies a function to subsets of a vector defined by a factor.

Form: tapply(X, INDEX = group, FUN = my.fun)

# Compute the mean and sd of the Frost variable, within each region
tapply(state.x77[,"Frost"], INDEX=state.region, FUN=mean)
##     Northeast         South North Central          West 
##      132.7778       64.6250      138.8333      102.1538
tapply(state.x77[,"Frost"], INDEX=state.region, FUN=sd)
##     Northeast         South North Central          West 
##      30.89408      31.30682      23.89307      68.87652
class(state.region)
## [1] "factor"
table(state.region)
## state.region
##     Northeast         South North Central          West 
##             9            16            12            13

Equivalent loop logic:

  1. Split vector by group
  2. Apply function within each group
  3. Collect results

split(): split by levels of a factor

The function split() split up the rows of a data frame by levels of a factor, as in: split(x, f=my.index) to split a data frame x according to levels of my.index

# Split up the state.x77 matrix according to region
state.by.reg = split(data.frame(state.x77), f=state.region)
class(state.by.reg) # The result is a list
## [1] "list"
names(state.by.reg) # This has 4 elements for the 4 regions
## [1] "Northeast"     "South"         "North Central" "West"
class(state.by.reg[[1]]) # Each element is a data frame
## [1] "data.frame"
# For each region, display the first 3 rows of the data frame
lapply(state.by.reg, FUN=head, 3) 
## $Northeast
##               Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
## Connecticut         3100   5348        1.1    72.48    3.1    56.0   139  4862
## Maine               1058   3694        0.7    70.39    2.7    54.7   161 30920
## Massachusetts       5814   4755        1.1    71.83    3.3    58.5   103  7826
## 
## $South
##          Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
## Alabama        3615   3624        2.1    69.05   15.1    41.3    20 50708
## Arkansas       2110   3378        1.9    70.66   10.1    39.9    65 51945
## Delaware        579   4809        0.9    70.06    6.2    54.6   103  1982
## 
## $`North Central`
##          Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
## Illinois      11197   5107        0.9    70.14   10.3    52.6   127 55748
## Indiana        5313   4458        0.7    70.88    7.1    52.9   122 36097
## Iowa           2861   4628        0.5    72.56    2.3    59.0   140 55941
## 
## $West
##            Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
# For each region, average each of the 8 numeric variables
lapply(state.by.reg, FUN=function(df) { 
  return(apply(df, MARGIN=2, mean)) 
})
## $Northeast
##   Population       Income   Illiteracy     Life.Exp       Murder      HS.Grad 
##  5495.111111  4570.222222     1.000000    71.264444     4.722222    53.966667 
##        Frost         Area 
##   132.777778 18141.000000 
## 
## $South
##  Population      Income  Illiteracy    Life.Exp      Murder     HS.Grad 
##  4208.12500  4011.93750     1.73750    69.70625    10.58125    44.34375 
##       Frost        Area 
##    64.62500 54605.12500 
## 
## $`North Central`
##  Population      Income  Illiteracy    Life.Exp      Murder     HS.Grad 
##  4803.00000  4611.08333     0.70000    71.76667     5.27500    54.51667 
##       Frost        Area 
##   138.83333 62652.00000 
## 
## $West
##   Population       Income   Illiteracy     Life.Exp       Murder      HS.Grad 
## 2.915308e+03 4.702615e+03 1.023077e+00 7.123462e+01 7.215385e+00 6.200000e+01 
##        Frost         Area 
## 1.021538e+02 1.344630e+05

Conceptual hierarchy

These functions are structured abstractions over loops:

All of them are still iteration.

The only thing removed is the visible counter.

When should you use apply() vs for()?

Use for() when:

Use apply() family when: