The apply family

R offers a family of apply functions, which allow you to apply a function across different chunks of data. Offers an alternative to explicit iteration using for() loop; can be simpler and faster, though not always and can lack readability. Summary of functions:

apply(): apply a function to rows or columns of a matrix or data frame
lapply(): apply a function to elements of a list or vector
sapply() and vapply(): same as the above, but simplify the output (if possible)
tapply(): apply a function to levels of a factor vector

`apply()`: rows or columns of a matrix or data frame

The apply() function takes inputs of the following form:

apply(x, MARGIN=1, FUN=my.fun), to apply my.fun() across rows of a matrix or data frame x
apply(x, MARGIN=2, FUN=my.fun), to apply my.fun() across columns of a matrix or data frame x

set.seed(123)

X <- matrix(rnorm(20), nrow = 5, ncol = 4)

out <- numeric(nrow(X))

for (i in 1:nrow(X)) {
  out[i] <- mean(X[i, ])
}

out <- apply(X, 1, mean)

apply() is still a loop. You just don’t see the counter.

class(state.x77) # Built-in matrix of states data, 50 states x 8 variables

## [1] "matrix" "array"

head(state.x77)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

apply(state.x77, MARGIN=2, FUN=min) # Minimum entry in each column

## Population     Income Illiteracy   Life Exp     Murder    HS Grad      Frost 
##     365.00    3098.00       0.50      67.96       1.40      37.80       0.00 
##       Area 
##    1049.00

apply(state.x77, MARGIN=2, FUN=max) # Maximum entry in each column

## Population     Income Illiteracy   Life Exp     Murder    HS Grad      Frost 
##    21198.0     6315.0        2.8       73.6       15.1       67.3      188.0 
##       Area 
##   566432.0

apply(state.x77, MARGIN=2, FUN=which.max) # Index of the max in each column

## Population     Income Illiteracy   Life Exp     Murder    HS Grad      Frost 
##          5          2         18         11          1         44         28 
##       Area 
##          2

Optimized functions for special tasks

Don’t overuse the apply paradigm! There’s lots of special functions that optimized will be both simpler and faster than using apply(). E.g.,

rowSums(), colSums(): for computing row, column sums of a matrix
rowMeans(), colMeans(): for computing row, column means of a matrix
max.col(): for finding the maximum position in each row of a matrix

Combining these functions with logical indexing and vectorized operations will enable you to do quite a lot. E.g., how to count the number of positives in each row of a matrix?

x = matrix(rnorm(9), 3, 3)
# Don't do this (much slower for big matrices)
apply(x, MARGIN=1, function(v) { return(sum(v > 0)) })

## [1] 1 1 0

# Do this insted (much faster, simpler)
rowSums(x > 0)

## [1] 1 1 0

`lapply()`: elements of a list or vector

lapply() applies a function to each element of a list (or vector, which is treated as a list of length-1 elements).

Suppose we have a function called my.fun(). Don’t worry about this syntax too much yet, we’ll talk about how to define custom functions in the next set of slides.

my_fun <- function(x) {
  mean(x) + sd(x)
}

my_list <- list(a = 1:5, b = 1:10, c = rnorm(20))
my_list

## $a
## [1] 1 2 3 4 5
## 
## $b
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $c
##  [1]  1.25381492  0.42646422 -0.29507148  0.89512566  0.87813349  0.82158108
##  [7]  0.68864025  0.55391765 -0.06191171 -0.30596266 -0.38047100 -0.69470698
## [13] -0.20791728 -1.26539635  2.16895597  1.20796200 -1.12310858 -0.40288484
## [19] -0.46665535  0.77996512

out <- vector("list", length(my_list))

for (i in 1:length(my_list)) {
  out[[i]] <- my_fun(my_list[[i]])
}

out

## [[1]]
## [1] 4.581139
## 
## [[2]]
## [1] 8.52765
## 
## [[3]]
## [1] 1.101703

lapply() version:

out <- lapply(my_list, my_fun)
out

## $a
## [1] 4.581139
## 
## $b
## [1] 8.52765
## 
## $c
## [1] 1.101703

lapply() is still a loop. The counter is hidden.

lapply(my_list, mean)

## $a
## [1] 3
## 
## $b
## [1] 5.5
## 
## $c
## [1] 0.2235237

lapply(my_list, length)

## $a
## [1] 5
## 
## $b
## [1] 10
## 
## $c
## [1] 20

Note: the output is always a list.

`sapply()`: elements of a list or vector

The sapply() function works just like lapply(), but tries to simplify the return value whenever possible. E.g., most common is the conversion from a list to a vector

If results are dimension one: returns a vector
If results are same dimension > one: returns a matrix
Otherwise: returns a list

Observe the difference in structure.

my_list <- list(a = 1:5, b = 1:10, c = rnorm(20))

lapply(my_list, FUN=mean)

## $a
## [1] 3
## 
## $b
## [1] 5.5
## 
## $c
## [1] 0.06571215

sapply(my_list, FUN=mean) # Simplifies the result, now a vector

##          a          b          c 
## 3.00000000 5.50000000 0.06571215

Why be cautious with sapply()?

Because the output type depends on the result. This can cause subtle bugs if structure changes.

For example:

lapply(my_list, FUN=summary)

## $a
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       2       3       3       4       5 
## 
## $b
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.50    5.50    7.75   10.00 
## 
## $c
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.54875 -0.25263  0.08843  0.06571  0.39678  1.51647

sapply(my_list, FUN=summary) ## returns a matrix

##         a     b           c
## Min.    1  1.00 -1.54875280
## 1st Qu. 2  3.25 -0.25263009
## Median  3  5.50  0.08842924
## Mean    3  5.50  0.06571215
## 3rd Qu. 4  7.75  0.39678206
## Max.    5 10.00  1.51647060

The output may be a matrix — or not — depending on what summary() returns.

`vapply()`: Type-safe version of `sapply()`

vapply() is a safer version of sapply().

You must explicitly specify the expected output type.

Form: vapply(X, FUN, FUN.VALUE = numeric(1))

FUN.VALUE tells R what each return value should look like.

vapply(my_list, mean, FUN.VALUE = numeric(1))

##          a          b          c 
## 3.00000000 5.50000000 0.06571215

This guarantees:

Each element returns a numeric scalar
Output will be a numeric vector
If a function returns the wrong type or length, R throws an error.

More robust inside larger functions.

`tapply()`: levels of a factor vector

tapply() applies a function to subsets of a vector defined by a factor.

Form: tapply(X, INDEX = group, FUN = my.fun)

# Compute the mean and sd of the Frost variable, within each region
tapply(state.x77[,"Frost"], INDEX=state.region, FUN=mean)

##     Northeast         South North Central          West 
##      132.7778       64.6250      138.8333      102.1538

tapply(state.x77[,"Frost"], INDEX=state.region, FUN=sd)

##     Northeast         South North Central          West 
##      30.89408      31.30682      23.89307      68.87652

class(state.region)

## [1] "factor"

table(state.region)

## state.region
##     Northeast         South North Central          West 
##             9            16            12            13

Equivalent loop logic:

Split vector by group
Apply function within each group
Collect results

`split()`: split by levels of a factor

The function split() split up the rows of a data frame by levels of a factor, as in: split(x, f=my.index) to split a data frame x according to levels of my.index

# Split up the state.x77 matrix according to region
state.by.reg = split(data.frame(state.x77), f=state.region)
class(state.by.reg) # The result is a list

## [1] "list"

names(state.by.reg) # This has 4 elements for the 4 regions

## [1] "Northeast"     "South"         "North Central" "West"

class(state.by.reg[[1]]) # Each element is a data frame

## [1] "data.frame"

# For each region, display the first 3 rows of the data frame
lapply(state.by.reg, FUN=head, 3)

## $Northeast
##               Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
## Connecticut         3100   5348        1.1    72.48    3.1    56.0   139  4862
## Maine               1058   3694        0.7    70.39    2.7    54.7   161 30920
## Massachusetts       5814   4755        1.1    71.83    3.3    58.5   103  7826
## 
## $South
##          Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
## Alabama        3615   3624        2.1    69.05   15.1    41.3    20 50708
## Arkansas       2110   3378        1.9    70.66   10.1    39.9    65 51945
## Delaware        579   4809        0.9    70.06    6.2    54.6   103  1982
## 
## $`North Central`
##          Population Income Illiteracy Life.Exp Murder HS.Grad Frost  Area
## Illinois      11197   5107        0.9    70.14   10.3    52.6   127 55748
## Indiana        5313   4458        0.7    70.88    7.1    52.9   122 36097
## Iowa           2861   4628        0.5    72.56    2.3    59.0   140 55941
## 
## $West
##            Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361

# For each region, average each of the 8 numeric variables
lapply(state.by.reg, FUN=function(df) { 
  return(apply(df, MARGIN=2, mean)) 
})

## $Northeast
##   Population       Income   Illiteracy     Life.Exp       Murder      HS.Grad 
##  5495.111111  4570.222222     1.000000    71.264444     4.722222    53.966667 
##        Frost         Area 
##   132.777778 18141.000000 
## 
## $South
##  Population      Income  Illiteracy    Life.Exp      Murder     HS.Grad 
##  4208.12500  4011.93750     1.73750    69.70625    10.58125    44.34375 
##       Frost        Area 
##    64.62500 54605.12500 
## 
## $`North Central`
##  Population      Income  Illiteracy    Life.Exp      Murder     HS.Grad 
##  4803.00000  4611.08333     0.70000    71.76667     5.27500    54.51667 
##       Frost        Area 
##   138.83333 62652.00000 
## 
## $West
##   Population       Income   Illiteracy     Life.Exp       Murder      HS.Grad 
## 2.915308e+03 4.702615e+03 1.023077e+00 7.123462e+01 7.215385e+00 6.200000e+01 
##        Frost         Area 
## 1.021538e+02 1.344630e+05

Control Flow: apply() family

The apply family

`apply()`: rows or columns of a matrix or data frame

Optimized functions for special tasks

`lapply()`: elements of a list or vector

`sapply()`: elements of a list or vector

`vapply()`: Type-safe version of `sapply()`

`tapply()`: levels of a factor vector

`split()`: split by levels of a factor

Conceptual hierarchy

When should you use `apply()` vs `for()`?

Control Flow: apply() family

The apply family

apply(): rows or columns of a matrix or data frame

Optimized functions for special tasks

lapply(): elements of a list or vector

sapply(): elements of a list or vector

vapply(): Type-safe version of sapply()

tapply(): levels of a factor vector

split(): split by levels of a factor

Conceptual hierarchy

When should you use apply() vs for()?

`apply()`: rows or columns of a matrix or data frame

`lapply()`: elements of a list or vector

`sapply()`: elements of a list or vector

`vapply()`: Type-safe version of `sapply()`

`tapply()`: levels of a factor vector

`split()`: split by levels of a factor

When should you use `apply()` vs `for()`?