We know how to do basic arithmetic operations like + (addition), - (subtraction), and * (multiplication) on scalars. Refer to the course material “Introduction to R_2”. Now, we will learn how to do arithmetic operations on numeric vectors:
# Make two vectors 'a' and 'b'
a <- c(1, 2, 3, 4, 5)
b <- c(10, 20, 30, 40, 50)
# Add the two vectors
a + b
## [1] 11 22 33 44 55
# Note that R has combined the vectors element-by-element.
#This happens when we do some operations on vectors of the same length.
Let us create two vectors ‘a’ and ‘b’ where each vector contains the
integers from 1 to 5. We will then create three new vectors
ab.sum (the sum of the two vectors), ab.diff
(the difference of the two vectors), and ab.prod (the
product of the two vectors):
# Make two vectors 'a' and 'b'
a <- 1:5
b <- 1:5
# Note that we have used `a:b` function to create the vectors
#(Refer to Introduction to R_3)
ab.sum <- a + b
ab.diff <- a - b
ab.prod <- a * b
ab.sum
## [1] 2 4 6 8 10
ab.diff
## [1] 0 0 0 0 0
ab.prod
## [1] 1 4 9 16 25
Now that we can create vectors, let us learn the basic descriptive
statistics functions. Each of these functions takes a numeric vector as
an argument, and returns a scalar (or a table in the case of
summary()) as a result. We will start with functions that
apply to continuous data, e.g., height and weight.
Students took two exams. Here are their results data.
midterm <- c(62, 68, 75, 79, 55, 62, 89, 76, 45, 67)
final <- c(78, 72, 97, 82, 60, 83, 92, 73, 50, 88)
x is the vectorWe need to find out how many students took each of these exams.
length(midterm)
## [1] 10
length(final)
## [1] 10
# What is the minimum score in the midterm exam?
min(midterm)
## [1] 45
# What is the highest score in the final exam?
max(final)
## [1] 97
# What is the average score in the midterm exam?
mean(midterm)
## [1] 67.8
# What is the median score in the final exam?
median(final)
## [1] 80
# What is the standard deviation of the midterm grades?
sd(midterm)
## [1] 12.67368
# What is the standard deviation of the final grades?
sd(final)
## [1] 14.39329
# Give the summary statistics of the midterm grades.
summary(midterm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 45.00 62.00 67.50 67.80 75.75 89.00
# Round elements to specified digits
round(6.978765,2)
## [1] 6.98
# Round elements to the next highest integer
ceiling(9.675436)
## [1] 10
# Round elements to the next lowest integer
ceiling(c(5.1, 7.9))
[1] 6 8
Next, we will move on to common counting functions for vectors with discrete or non-numeric data. Discrete data are those like gender and occupation, that only allow for a finite (or at least, plausibly finite) set of responses. Common functions for discrete vectors take a vector as an argument – however, unlike the previous functions we looked at, the arguments to these functions can be either numeric or character.
For example, here are the gender and height (in inch) data of some students in a class.
gender <- c("M", "M", "F", "F", "F", "M", "F", "M", "F") # M = Male, F = Female
height <- c(160, 165, 165, 155, 155, 160, 169, 170, 168)
# To identify the unique values in the vectors.
unique(gender)
## [1] "M" "F"
unique(height)
## [1] 160 165 155 169 170 168
# Note that this function will tell us all the unique values in the vector,
# but will not tell us anything about how often each value occurs.
# To identify the unique values in the vectors.
table(gender)
## gender
## F M
## 5 4
table(height)
## height
## 155 160 165 168 169 170
## 2 2 2 1 1 1
# If we want to get a table of percentages instead of counts,
# we can just divide the result of the `table()` function
# by the `sum()` of the result:
table(gender) / sum(table(gender))
## gender
## F M
## 0.5555556 0.4444444
A common task in statistics is to standardize variables – also known as calculating z-scores. The purpose of standardizing a vector is to put it on a common scale which allows us to compare it to other (standardized) variables.
To standardize a vector, we need to subtract the vector from its mean, and then divide the result by the vector’s standard deviation.
For example, we had a competition result of five best athletes of a class. In the tournament, we had cricket and football games. In cricket, we recorded the total runs of these athletes in the tournament. In football, we counted the number of goals these athletes scored in the tournament. Here is the result:
| Athlete | Cricket | Football |
|---|---|---|
| AA | 120 | 2 |
| AB | 80 | 6 |
| AC | 100 | 1 |
| AD | 60 | 8 |
| AE | 50 | 12 |
We can represent the results with two vectors cricket
and football:
cricket <- c(120, 80, 100, 60, 50)
football <- c(2, 6, 1, 8, 12)
As we can see, the scales of the numbers are very different. While
the football numbers range from 1 to 12, the
cricket numbers have a much larger range from 50 to 120.
This makes it difficult to compare the two sets of numbers directly.
To solve this problem, we will use standardization. We will first
create new standardized vectors called cricket.z and
football.z.
cricket.z <- (cricket - mean(cricket)) / sd(cricket)
football.z <- (football - mean(football)) / sd(football)
cricket.z
## [1] 1.32701756 -0.06984303 0.62858727 -0.76827333 -1.11748847
football.z
## [1] -0.84548890 0.04449942 -1.06798598 0.48949358 1.37948189
It looks like there were two outstanding performances in particular. In cricket, the first athlete (AA) had a z-score of 1.33. We can interpret this by saying that AA scored 1.33 more standard deviations of runs than the average athletes. In football, the last athlete (AE) had a z-score of 1.38. Here, we would conclude that AE scored 1.3 standard deviations more than the average athlete.
But which athlete was the best on average across both events?
To answer this, we will create a combined z-score for each athlete, which calculates the average z-scores for each athlete across the two events. We will do this by adding two performances and dividing by two. This will tell us, how good, on average, each athlete did relative to her fellow athletes.
average.z <- (cricket.z + (football.z)) / 2
average.z
## [1] 0.24076433 -0.01267181 -0.21969936 -0.13938987 0.13099671
The highest average z-score belongs to the first athlete (AA) who had an average z-score value of 0.24.