We now know about the scalar and vector objects. However, neither object types are appropriate for storing lots of data – such as the results of a survey or experiment.
R has two object types that represent large data structures much better: matrices and dataframes.
Matrices and dataframes are very similar to spreadsheets in Excel.
Every matrix or dataframe contains rows (let’s call it m)
and columns (n). Thus, while a vector has one dimension
(its length), matrices and dataframes have 2-dimensions – representing
their height (i.e., rows) and width (i.e., columns).
While matrices and dataframes look very similar, they are not exactly the same.
While a matrix can contain either character or numeric columns, a dataframe can contain both numeric and character columns.
Because dataframes are more flexible, most real-world datasets, such as surveys containing both numeric (e.g., age) and character (e.g., favorite movie) data, are stored as dataframes in R.
If dataframes are more flexible than matrices, why do we use matrices at all? Because the matrices are simpler and take up less computational space than dataframes. Additionally, some functions require matrices as inputs to ensure that they work correctly.
There are a number of ways to create matrix and dataframe objects in R. Because matrices and dataframes are just combinations of vectors, each function takes one or more vectors as inputs, and returns a matrix or a dataframe. Some of the common functions are:
cbind()
and rbind()cbind() and rbind() both create matrices by
combining several vectors of the same length.
cbind() combines vectors as columns, while
rbind() combines them as rows.
Let’s use these functions to create a matrix with the numbers 1 through 30.
# First, we will create three vectors of length 5
x <- 1:5
y <- 6:10
z <- 11:15
# Then we will combine them into one matrix
# Create a matrix where x, y and z are columns
cbind(x, y, z)
## x y z
## [1,] 1 6 11
## [2,] 2 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
# Create a matrix where x, y and z are rows
rbind(x, y, z)
## [,1] [,2] [,3] [,4] [,5]
## x 1 2 3 4 5
## y 6 7 8 9 10
## z 11 12 13 14 15
matrix()The matrix() function creates a matrix form a single
vector of data.
The function has 4 main inputs:
data – a vector of data,nrow – the number of rows we want in the matrix,ncol – the number of columns we want in the matrix,
andbyrow – a logical value indicating whether we want to
fill the matrix by rows.# Create a matrix of the integers 1:10,
# with 5 rows and 2 columns
matrix(data = 1:10,
nrow = 5,
ncol = 2)
## [,1] [,2]
##[1,] 1 6
##[2,] 2 7
##[3,] 3 8
##[4,] 4 9
##[5,] 5 10
data.frame()To create a dataframe from vectors, we will use the
data.frame() function. The data.frame()
function works very similarly to cbind() – the only
difference is that in data.frame(), we specify the names to
each of the columns as we define them.
Note that, unlike matrices, dataframes can contain both string vectors and numeric vectors within the same object. Because they are more flexible than matrices, most large datasets in R will be stored as dataframes.
# Let's create a simple dataframe called `survey` with a mixture of text
# and numeric columns. The text column records the gender of the participants
# and the numeric column records the age of the participants.
survey <- data.frame("index" = c(1, 2, 3, 4, 5),
"sex" = c("m", "m", "m", "f", "f"),
"age" = c(99, 46, 23, 54, 23))
survey
## index sex age
## 1 1 m 99
## 2 2 m 46
## 3 3 m 23
## 4 4 f 54
## 5 5 f 23
R has lots of functions for viewing matrices and dataframes and returning information about them. Some of the most common are:
# head() shows the first few rows
head(iris)
# tail() shows the last few rows
tail(iris)
# view() shows an entire dataframe in a separate window
# that looks like spreadsheet
View(iris)
# dim() shows the number of rows and columns
dim(iris)
## [1] 150 5
# `summary()`to get summary statistics on all columns in a dataframe
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
## Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
The results will look like the following:
head() function
tail() function
view() function
Each column of a dataframe has a name. We can use these names to access specific columns without having to know which column number it is.
To access the names of a dataframe, we use the names()
function. It returns a string vector with the column names.
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
To access a specific column in a dataframe by name, we use the
$ operator in the form df$name, where
df is the name of the dataframe, and name is
the name of the column we are interested in. This operation will then
return the specified column as a vector.
# Let's use the $ operator to get a vector of just the `Sepal.Width` column
# from the 'iris` dataframe
iris$Sepal.Width
The results will look like the following:
Because the $ operator returns a vector, we can easily
calculate descriptive statistics on a column of a dataframe by applying
a vector function, like mean() or table(). For
example:
# Let's calculate the average petal width of the iris dataframe
mean(iris$Petal.Width)
## [1] 1.199333
If we want to access several columns by name, we use a character vector of column names in brackets:
# Give me the `Sepal.Length` AND `Petal.Width` columns of the `iris` dataframe
head(iris[c("Sepal.Length", "Petal.Width")])
The results will look like the following:
We can add new columns to a dataframe. To do this, we will just use
the df$name notation and assign a new vector of data to
it.
# For example, let us create a dataframe called `survey' with two columns:
# index and age
survey <- data.frame("index" = c(1, 2, 3, 4, 5),
"age" = c(24, 25, 42, 56, 22))
survey
The result will look like the following:
# Now we will add a column `gender` with a vector of `gender` data
survey$gender <- c("m", "m", "f", "f", "m")
survey
The result will now look like the following:
To change the name of a column in a dataframe, we use a logical
vector using the format
names(df)[names(df) == "old.name"] <- "new.name".
Here is how to read this:
“Change the names of df, but only where the original name was "old.name", to "new.name".
For example, in the survey dataframe, we want to change
the column name index to
participant.number.
names(survey)[names(survey) == "index"] <- "participant.number"
survey
The result will now look like the following: