1 Programming and Algorithm

Programming refers to a technological process for telling a computer which tasks to perform in order to solve problems. We can think of programming as a collaboration between humans and computers, in which humans create instructions for a computer in a language that the computer can understand and follow.

This set of well-defined instructions to solve a particular problem is called algorithm. It takes a set of input(s) and produces the desired output. For example, an algorithm to add 2 numbers will be like this: Take 2 number inputs > Add numbers using the + operator > Display the result.

How does computer programming work? At its most basic, programming tells a computer what to do.

Some examples

Most used programming languages

According to a survey, the top five programming languages that developers use as of 2023 are:

Qualities of a good algorithm

2 What is R?

R is a programming language for statistical computing and graphics. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners, bioinformaticians and statisticians for data analysis and developing statistical software. The core R language is enhanced by a large number of extension packages containing reusable code and documentation.

2.1 Features of R

  1. R is an interpreted language. Remember the steps mentioned in Page 1 of “How does computer programming work?”. Being an interpreter, R directly executes instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program (i.e., it can skip step 2).
  2. R can be accessed through a command-line interpreter. If a user types 2+2 at the R command prompt and presses enter, the computer replies with 4.
  3. R’s capabilities are extended through user-created packages. These packages offer statistical techniques, graphical devices, import/export, reporting etc. Easy installation and wide use of these packages drive R’s widespread adoption in data science.
  4. Early developers preferred to run R via the command line console. In recent times, R is favored to run by an IDE (integrated development environment). Examples of IDEs for R include R Commander, RStudio.
  5. Using RStudio, one can seamlessly combine R codes, analyses, plots, and written text into documents with the help of RMarkdown.
  6. Finally, R is 100% free and as a result, has a huge support community. It means that a huge community of R programmers will constantly develop and distribute new R functionality and packages.

To use R, we will need two software packages: Base-R and RStudio. Base-R is the basic software which contains the R programming language. RStudio is software that makes R programming easier.

2.2 Installing base R

Base R can be installed from one of the links below (depending on operating system) and following the instructions:

Click here for Windows

Click here for Mac

Note: R and RStudio are constantly being updated with new features and bug-fixes. The latest version (as of 19 December 2023) of Base-R is 4.3.2 “Eye Holes”, and the latest version of RStudio is 2023.12.0+369. We should update R and RStudio to the newest version(s) periodically; otherwise, some of the codes and packages may not work.

3 RStudio

3.1 Installing RStudio

While we can do pretty much everything within base R, we will do R programming in an application called RStudio. It is a graphical user interface (GUI)-like interface for R that makes programming in R a bit easier. Once we have installed RStudio, we will likely never need to open the base R application again.

RStudio can be downloaded from here, and can be installed following the on-screen instructions.

3.2 Introduction to RStudio

When we open RStudio, we can see the following four windows (also called panes):

Pane 1: Source

This is where we write our code.

When we open RStudio, it will automatically start a new Untitled script. We should always save it with a new file name (e.g., “script1.R”). If something happens while we are working, R will have our code waiting for us when we re-open RStudio.

Note that when we type a code in the Source panel, R will not actually run the code. To run the code, we need to first ‘send’ the code to the Console (Pane 2).

The fastest way to send the code from Source to Console is to highlight the code and clicking on the “Run” button on the top right of the Source. The shortcut keys for this task are: “Command + Return” on Mac, and “Control + Enter” on PC.

Pane 2: Console

The console is the heart of R. Here is where R actually evaluates code.

At the beginning of the console, we can see the character >. This is a prompt that tells us that R is ready for new code. We can type code directly into the console after the prompt and get an immediate response. For example, if we type 1+1 into the console and press enter, we will have an output of 2. We can also type 1+1 into the source, then select the code, and use the “Run” button to get the result.

So we can see that we can execute code either by running it from the Source or by typing it directly into the Console.

However, most of the time, we should use the Source rather than the Console. The reason for this is:

  • If we type a code into the console, it won’t be saved (although we can look back on the command History).
  • And if we make a mistake in typing code into the console, we need to re-type everything all over again.

Therefore, it is better to write all our code in the Source. When we are ready to execute it, we can then use “Run” and send it to the Console.

Pane 3: Environment

The Environment tab of this panel lists the names of all the data objects (like vectors, matrices, and data frames) that we are using in our current R session. We can also have information like the number of observations and rows in data objects.

The tab also has a few clickable actions like “Import Dataset”, which will open a graphical user interface (GUI) to import data into R. We can click the “Broom” icon to clear the contents of this pane.

The History tab of this panel shows the history of all the codes we have previously evaluated in the Console. As we progress further with R, we will find this pane useful. For now, let us keep it aside, and move to the 4th pane of RStudio.

Pane 4: Files/Plots/…

This panel shows us lots of helpful information. Let us go through each tab in detail:

  • Files - The files panel gives us access to the file directory on our hard drive. We can use it to set our working directory. We will talk about working directories in more detail soon.
  • Plots - The Plots panel shows all our plots. There are buttons for opening the plot in a separate window and exporting the plot as a pdf or jpeg.
  • Packages - Shows a list of all the R packages installed on our hard drive and indicates whether or not they are currently loaded. Packages that are loaded in the current session are checked, while those that are installed but not yet loaded are unchecked. We will discuss packages in more detail in the next section.
  • Help - Help menu for R functions. We can either type the name of a function in the search window or use the code to search for a function with the name.

4 Work Organization - Projects

When we are performing an analysis we will typically be using many files - input data, files containing code to perform the analysis, and results. By creating a project in Rstudio, we make it easier to manage these files.

Let us start the course by making a new project in RStudio, and copying some data into it that we will use in future.

Recommendations:

  1. Put each project in its own directory, which is named after the project.
  2. Put text documents associated with the project in the doc directory.
  3. Put raw data and metadata in the data directory.
  4. Put files generated during clean-up and analysis in a results directory.
  5. Put source for the scripts and programs in the src directory.
  6. Name all files to reflect their content or function.

We will create 3 directories: data, doc, results and src directories in our project directory. The directory should look like this in Pane 4.

Note that the path (“Home  Library  …”) will vary according to where we created the project. Now when we start R in this project directory, or open this project with RStudio, all of our work on this project will be entirely self-contained in this directory.

5 Working Directory

The working directory determines where files will be loaded from and saved to by default. The current working directory is shown above the console.

If this is not our project’s directory, we can set our working directory as follows:

6 Packages

When we download and install R for the first time, we are installing the Base R software. Base R will contain most of the functions we will use on a daily basis like mean() and hist(). However, only functions written by the original authors of the R language will appear here. If we want to access data and code written by other people, we need to install it as a package.

An R package is simply a bunch of data, functions, help menus, and vignettes (examples), stored in one neat place.

6.1 Installing package

Installing a package simply means downloading the package code onto your personal computer. To download them from the Comprehensive R Archive Network (CRAN).

CRAN is the central repository for R packages. To install a new R package from CRAN, we can simply run the code: install.packages("name"), where “name” is the name of the package. If everything works fine, we should see some information about where the package is being downloaded from, in addition to a progress bar.

For example,

install.packages(ggplot2)

Once we have installed a package on our computer, we never need to install it again (unless we want to install a new version of the package). However, every time we want to use it, we need to turn it on by loading it.

6.2 Loading package

To load a package, we use the library() function. For example, now that we have installed the ggplot2 package, we can load it with library("ggplot2").

library(ggplot2)

Now that we have loaded the ggplot2 package, we can use any of its functions!

7 Codes, Comments and Elements

Here, R code is presented in a separate gray box like the one below: Lines that begin with # (at least one) are comments.

Note: The comments starting with single # are comments that I write directly to explain code. Lines starting with ## are the output from the previous line(s) of code. When you run the code yourself, you should see the same output in your console.

# Define a vector a as the integers from 1 to 5
a <- 1:5
# Print a
a
## [1] 1 2 3 4 5
# What is the mean of a?
mean(a)
## [1] 3

The output we see will often start with one or more number(s) in brackets such as [1]. This is just a visual way of telling us where the numbers occur in the output.

# Generate a long vector containing the multiples of 2 from 0 to 100
seq(from = 0, to = 100, by = 2)
## [1]   0   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38  40  42  44
## [24]  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76  78  80  82  84  86  88  90
## [47]  92  94  96  98 100