1 Arithmetic operations

We know how to do basic arithmetic operations like + (addition), - (subtraction), and * (multiplication) on scalars. Refer to the course material “Introduction to R_2”. Now, we will learn how to do arithmetic operations on numeric vectors:

# Make two vectors 'a' and 'b'
a <- c(1, 2, 3, 4, 5)
b <- c(10, 20, 30, 40, 50)
# Add the two vectors
a + b
## [1] 11 22 33 44 55
# Note that R has combined the vectors element-by-element. 
#This happens when we do some operations on vectors of the same length.

Let us create two vectors ‘a’ and ‘b’ where each vector contains the integers from 1 to 5. We will then create three new vectors ab.sum (the sum of the two vectors), ab.diff (the difference of the two vectors), and ab.prod (the product of the two vectors):

# Make two vectors 'a' and 'b'
a <- 1:5 
b <- 1:5
# Note that we have used `a:b` function to create the vectors 
#(Refer to Introduction to R_3)
ab.sum <- a + b
ab.diff <- a - b
ab.prod <- a * b
ab.sum
## [1]  2  4  6  8 10
ab.diff
## [1] 0 0 0 0 0
ab.prod
## [1]  1  4  9 16 25

2 Summary statistics

Now that we can create vectors, let us learn the basic descriptive statistics functions. Each of these functions takes a numeric vector as an argument, and returns a scalar (or a table in the case of summary()) as a result. We will start with functions that apply to continuous data, e.g., height and weight.

Students took two exams. Here are their results data.

midterm <- c(62, 68, 75, 79, 55, 62, 89, 76, 45, 67)
final <- c(78, 72, 97, 82, 60, 83, 92, 73, 50, 88)

2.1 Length

  • length(x), where x is the vector
We need to find out how many students took each of these exams.
length(midterm)
## [1] 10
length(final)
## [1] 10

2.2 Minimum - Maximum

  • min(x)
  • max(x)
# What is the minimum score in the midterm exam?
min(midterm)
## [1] 45
# What is the highest score in the final exam?
max(final)
## [1] 97

2.3 Mean and Median

  • mean(x)
  • median(x)
# What is the average score in the midterm exam?
mean(midterm)
## [1] 67.8
# What is the median score in the final exam?
median(final)
## [1] 80

2.4 Variance and standard deviation

  • var(x)
  • sd(x)
  • range(x)
# What is the standard deviation of the midterm grades?
sd(midterm)
## [1] 12.67368
# What is the standard deviation of the final grades?
sd(final)
## [1] 14.39329

2.5 Summary

  • summary(x)
# Give the summary statistics of the midterm grades.
summary(midterm)
## Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 45.00   62.00   67.50   67.80   75.75   89.00

2.6 Round elements

  • round(x,digits)
  • ceiling(x)
  • floor(x)
# Round elements to specified digits
round(6.978765,2)
## [1] 6.98
# Round elements to the next highest integer
ceiling(9.675436)
## [1] 10
# Round elements to the next lowest integer
ceiling(c(5.1, 7.9))
[1] 6 8

3 Counting statistics

Next, we will move on to common counting functions for vectors with discrete or non-numeric data. Discrete data are those like gender and occupation, that only allow for a finite (or at least, plausibly finite) set of responses. Common functions for discrete vectors take a vector as an argument – however, unlike the previous functions we looked at, the arguments to these functions can be either numeric or character.

For example, here are the gender and height (in inch) data of some students in a class.

gender <- c("M", "M", "F", "F", "F", "M", "F", "M", "F") # M = Male, F = Female
height <- c(160, 165, 165, 155, 155, 160, 169, 170, 168)

3.1 Unique data

  • unique(x)
# To identify the unique values in the vectors.
unique(gender)
## [1] "M" "F"
unique(height)
## [1] 160 165 155 169 170 168
# Note that this function will tell us all the unique values in the vector, 
# but will not tell us anything about how often each value occurs.

3.2 Unique data with frequency

  • table(x)
# To identify the unique values in the vectors.
table(gender)
## gender
## F M 
## 5 4
table(height)
## height
## 155 160 165 168 169 170 
## 2   2   2   1   1   1 

# If we want to get a table of percentages instead of counts, 
# we can just divide the result of the `table()` function 
# by the `sum()` of the result:
table(gender) / sum(table(gender))
## gender
## F         M 
## 0.5555556 0.4444444

4 Standardization

A common task in statistics is to standardize variables – also known as calculating z-scores. The purpose of standardizing a vector is to put it on a common scale which allows us to compare it to other (standardized) variables.

To standardize a vector, we need to subtract the vector from its mean, and then divide the result by the vector’s standard deviation.

For example, we had a competition result of five best athletes of a class. In the tournament, we had cricket and football games. In cricket, we recorded the total runs of these athletes in the tournament. In football, we counted the number of goals these athletes scored in the tournament. Here is the result:

Athlete Cricket Football
AA 120 2
AB 80 6
AC 100 1
AD 60 8
AE 50 12

We can represent the results with two vectors cricket and football:

cricket <- c(120, 80, 100, 60, 50)
football <- c(2, 6, 1, 8, 12)

As we can see, the scales of the numbers are very different. While the football numbers range from 1 to 12, the cricket numbers have a much larger range from 50 to 120. This makes it difficult to compare the two sets of numbers directly.

To solve this problem, we will use standardization. We will first create new standardized vectors called cricket.z and football.z.

cricket.z <- (cricket - mean(cricket)) / sd(cricket)
football.z <- (football - mean(football)) / sd(football)
cricket.z
## [1]  1.32701756 -0.06984303  0.62858727 -0.76827333 -1.11748847
football.z
## [1] -0.84548890  0.04449942 -1.06798598  0.48949358  1.37948189

It looks like there were two outstanding performances in particular. In cricket, the first athlete (AA) had a z-score of 1.33. We can interpret this by saying that AA scored 1.33 more standard deviations of runs than the average athletes. In football, the last athlete (AE) had a z-score of 1.38. Here, we would conclude that AE scored 1.3 standard deviations more than the average athlete.

But which athlete was the best on average across both events?

To answer this, we will create a combined z-score for each athlete, which calculates the average z-scores for each athlete across the two events. We will do this by adding two performances and dividing by two. This will tell us, how good, on average, each athlete did relative to her fellow athletes.

average.z <- (cricket.z + (football.z)) / 2
average.z
## [1]  0.24076433 -0.01267181 -0.21969936 -0.13938987  0.13099671

The highest average z-score belongs to the first athlete (AA) who had an average z-score value of 0.24.