1 What are matrices and dataframes?

We now know about the scalar and vector objects. However, neither object types are appropriate for storing lots of data – such as the results of a survey or experiment.

R has two object types that represent large data structures much better: matrices and dataframes.

Matrices and dataframes are very similar to spreadsheets in Excel. Every matrix or dataframe contains rows (let’s call it m) and columns (n). Thus, while a vector has one dimension (its length), matrices and dataframes have 2-dimensions – representing their height (i.e., rows) and width (i.e., columns).

While matrices and dataframes look very similar, they are not exactly the same.

While a matrix can contain either character or numeric columns, a dataframe can contain both numeric and character columns.

Because dataframes are more flexible, most real-world datasets, such as surveys containing both numeric (e.g., age) and character (e.g., favorite movie) data, are stored as dataframes in R.

If dataframes are more flexible than matrices, why do we use matrices at all? Because the matrices are simpler and take up less computational space than dataframes. Additionally, some functions require matrices as inputs to ensure that they work correctly.

2 Creating matrices and dataframes

There are a number of ways to create matrix and dataframe objects in R. Because matrices and dataframes are just combinations of vectors, each function takes one or more vectors as inputs, and returns a matrix or a dataframe. Some of the common functions are:

2.1 cbind() and rbind()

cbind() and rbind() both create matrices by combining several vectors of the same length.

cbind() combines vectors as columns, while rbind() combines them as rows.

Let’s use these functions to create a matrix with the numbers 1 through 30.

# First, we will create three vectors of length 5
x <- 1:5
y <- 6:10
z <- 11:15
# Then we will combine them into one matrix 
# Create a matrix where x, y and z are columns
cbind(x, y, z)
##      x  y  z
## [1,] 1  6 11
## [2,] 2  7 12
## [3,] 3  8 13
## [4,] 4  9 14
## [5,] 5 10 15
# Create a matrix where x, y and z are rows
rbind(x, y, z)
##    [,1] [,2] [,3] [,4] [,5]
## x    1    2    3    4    5
## y    6    7    8    9   10
## z   11   12   13   14   15

2.2 matrix()

The matrix() function creates a matrix form a single vector of data.

The function has 4 main inputs:

  • data – a vector of data,
  • nrow – the number of rows we want in the matrix,
  • ncol – the number of columns we want in the matrix, and
  • byrow – a logical value indicating whether we want to fill the matrix by rows.
# Create a matrix of the integers 1:10,
# with 5 rows and 2 columns
matrix(data = 1:10,
       nrow = 5,
       ncol = 2)

##     [,1] [,2]
##[1,]    1    6
##[2,]    2    7
##[3,]    3    8
##[4,]    4    9
##[5,]    5   10

2.3 data.frame()

To create a dataframe from vectors, we will use the data.frame() function. The data.frame() function works very similarly to cbind() – the only difference is that in data.frame(), we specify the names to each of the columns as we define them.

Note that, unlike matrices, dataframes can contain both string vectors and numeric vectors within the same object. Because they are more flexible than matrices, most large datasets in R will be stored as dataframes.

# Let's create a simple dataframe called `survey` with a mixture of text 
# and numeric columns. The text column records the gender of the participants 
# and the numeric column records the age of the participants.
survey <- data.frame("index" = c(1, 2, 3, 4, 5),
                     "sex" = c("m", "m", "m", "f", "f"),
                     "age" = c(99, 46, 23, 54, 23))
survey
##   index sex age
## 1     1   m  99
## 2     2   m  46
## 3     3   m  23
## 4     4   f  54
## 5     5   f  23

3 Matrix and Dataframe functions

R has lots of functions for viewing matrices and dataframes and returning information about them. Some of the most common are:

# head() shows the first few rows
head(iris)

# tail() shows the last few rows
tail(iris)

# view() shows an entire dataframe in a separate window 
# that looks like spreadsheet
View(iris)

# dim() shows the number of rows and columns
dim(iris)
## [1] 150   5

# `summary()`to get summary statistics on all columns in a dataframe
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
## Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
## 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
## Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
## Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
## 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
## Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

The results will look like the following:

head() function

tail() function

view() function

4 Dataframe columns

Each column of a dataframe has a name. We can use these names to access specific columns without having to know which column number it is.

4.1 Column names

To access the names of a dataframe, we use the names() function. It returns a string vector with the column names.

names(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species" 

4.2 Access specific columns

To access a specific column in a dataframe by name, we use the $ operator in the form df$name, where df is the name of the dataframe, and name is the name of the column we are interested in. This operation will then return the specified column as a vector.

# Let's use the $ operator to get a vector of just the `Sepal.Width` column 
# from the 'iris` dataframe
iris$Sepal.Width

The results will look like the following:

Because the $ operator returns a vector, we can easily calculate descriptive statistics on a column of a dataframe by applying a vector function, like mean() or table(). For example:

# Let's calculate the average petal width of the iris dataframe
mean(iris$Petal.Width)

## [1] 1.199333

If we want to access several columns by name, we use a character vector of column names in brackets:

# Give me the `Sepal.Length` AND `Petal.Width` columns of the `iris` dataframe
head(iris[c("Sepal.Length", "Petal.Width")])

The results will look like the following:

4.3 Adding new columns

We can add new columns to a dataframe. To do this, we will just use the df$name notation and assign a new vector of data to it.

# For example, let us create a dataframe called `survey' with two columns: 
# index and age
survey <- data.frame("index" = c(1, 2, 3, 4, 5),
                     "age" = c(24, 25, 42, 56, 22))
survey

The result will look like the following:

# Now we will add a column `gender` with a vector of `gender` data
survey$gender <- c("m", "m", "f", "f", "m")
survey

The result will now look like the following:

4.4 Changing column names

To change the name of a column in a dataframe, we use a logical vector using the format names(df)[names(df) == "old.name"] <- "new.name".

Here is how to read this: “Change the names of df, but only where the original name was "old.name", to "new.name".

For example, in the survey dataframe, we want to change the column name index to participant.number.

names(survey)[names(survey) == "index"] <- "participant.number"
survey

The result will now look like the following: