Data Structures

Data types are the fundamental buildings. Data structures organize these data types. ## Common data structures in R

Vector
A one-dimensional sequence of values, all of the same type
Matrix
A two-dimensional array of values, all of the same type
List
A collection of objects that can be of different types and structures
Data frame
A table-like structure where each column is a vector; different columns may have different types
Factor
A special object used to represent categorical data

Lists

A list is the most general form of vectors in R.

List entries can be of any type and can have mixed types

l <- list(1:2, c("hat","mat","dat"))
l

## [[1]]
## [1] 1 2
## 
## [[2]]
## [1] "hat" "mat" "dat"

List entries can be named:

lNamed <- list(foo = 1:2, bar = c("hat","mat","dat"))

lNamed

## $foo
## [1] 1 2
## 
## $bar
## [1] "hat" "mat" "dat"

Most of what you can do with vectors you can also do with lists

Accessing pieces of lists

Can use [ ] as with vectors
Or use [[ ]], but only with a single index [[ ]] drops names and structures, [ ] does not

l[2]

## [[1]]
## [1] "hat" "mat" "dat"

l[[2]]

## [1] "hat" "mat" "dat"

l[[2]][1]

## [1] "hat"

Expanding and contracting lists

Add to lists with c() (also works with vectors):

l1 <- c(list(TRUE),l)
l1

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] 1 2
## 
## [[3]]
## [1] "hat" "mat" "dat"

str(l1)

## List of 3
##  $ : logi TRUE
##  $ : int [1:2] 1 2
##  $ : chr [1:3] "hat" "mat" "dat"

append(x, values, after) works similarly:

l2 <- append(l, list(TRUE),after =0)
## after specifies the subscript position 
l2

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] 1 2
## 
## [[3]]
## [1] "hat" "mat" "dat"

str(l2)

## List of 3
##  $ : logi TRUE
##  $ : int [1:2] 1 2
##  $ : chr [1:3] "hat" "mat" "dat"

Set a list entry NULL in order to remove it:

l1[2:3] <- NULL
str(l1)

## List of 1
##  $ : logi TRUE

Flattening lists

unlist flattens a list into vector. If it contains mixed types, type conversion will be done automatically.

l3 <- c(list(1) ,l)
l3

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1 2
## 
## [[3]]
## [1] "hat" "mat" "dat"

unlist(l3)

## [1] "1"   "1"   "2"   "hat" "mat" "dat"

Names in lists

We can name some or all of the elements of a list:

my.dist = list("exponential", 7, FALSE)
my.dist

## [[1]]
## [1] "exponential"
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] FALSE

names(my.dist) = c("family","mean","is.symmetric")
my.dist

## $family
## [1] "exponential"
## 
## $mean
## [1] 7
## 
## $is.symmetric
## [1] FALSE

my.dist[["family"]]

## [1] "exponential"

my.dist["family"]

## $family
## [1] "exponential"

===

In addition to indexing, lists have a special shortcut way of using names, with $:

my.dist[["family"]]

## [1] "exponential"

my.dist$family

## [1] "exponential"

Adding named elements:

my.dist$was.estimated = FALSE
my.dist[["last.updated"]] = "2026-01-01"

Key-value pairs

Lists give us a natural way to store and look up data by name, rather than by position
A really useful programming concept with many names: key-value pairs, i.e., dictionaries, or associative arrays
If all our distributions have components named family, we can look that up by name, without caring where it is (in what position it lies) in the list

# Defining a list with named components (key-value pairs)
normal_dist <- list(
  family = "Gaussian",
  mean = 0,
  sd = 1,
  link = "identity"
)

# Retrieval by name (position-agnostic)
normal_dist$family

## [1] "Gaussian"

# [1] "Gaussian"

# Alternative syntax using character strings
normal_dist[["family"]]

## [1] "Gaussian"

# [1] "Gaussian"

Data frames

The Canonical Data Structure: Represents the standard $n \times p$ data matrix, where $n$ rows correspond to observations (cases) and $p$ columns represent variables (features).
Statistical Standard: Most modeling and visualization functions in R (e.g., lm(), ggplot2) are designed specifically to operate on data frames as the primary input.
Heterogeneous Storage: Unlike a matrix, which requires all elements to be the same data type, a data frame allows each column to have a distinct type (e.g., numeric, factor, or character) while maintaining equal length. Each column operates like a separate vector.
Data frames inherit behaviors from both lists and matrices; you can use matrix-like operations (e.g., rowSums(), dim()) and summary tools (e.g., summary(), str()) to audit your data efficiently.

Matrix vs. Data frame

a.mat = matrix(c(35,8,10,4), nrow=2)
colnames(a.mat) = c("v1","v2")
a.mat

##      v1 v2
## [1,] 35 10
## [2,]  8  4

a.mat[,"v1"]

## [1] 35  8

# Try a.mat$v1 and see what happens
a.mat$v1

Error in a.mat$v1 : $ operator is invalid for atomic vectors

a.df = data.frame(a.mat,logicals=c(TRUE,FALSE))
a.df

##   v1 v2 logicals
## 1 35 10     TRUE
## 2  8  4    FALSE

a.df$v1

## [1] 35  8

a.df[,"v1"]

## [1] 35  8

a.df[1,]

##   v1 v2 logicals
## 1 35 10     TRUE

colMeans(a.df)

##       v1       v2 logicals 
##     21.5      7.0      0.5

Data Structures: List and Data Frames