Lesson 1: R Basics

Basic arithmetic, assignment

The normal mathematic operators work in R, and they follow the order of operations:

1 / 20 * 30

## [1] 1.5

You can use parentheses to change the order in which the expressions are calculated:

1 / (20 * 30)

## [1] 0.001666667

cos(2 * pi)

## [1] 1

You can assign values to objects with <-, the assignment operator. The shortcut for this is:

For Windows Users, Alt + -
For Mac Users, Cmd + -

After assignment, we can perform whatever operations we would like with that variable:

x <- 5
x + 5

## [1] 10

sin(2 * x)

## [1] -0.5440211

x * 500 / sin(x)

## [1] -2607.088

R also has the modulus operator %%, which returns the remainder when x is divided by y. For example:

5 %% 2

## [1] 1

Exercise 1.1

How can you tell when a number is divisible by 3?
How can you tell when a number is divisible by 10?
How can you tell when a number is odd?

Using functions

Every programming language has functions. In R, they look something like:

function_name(arg_1 = ..., arg_2 =..., ...)

For example, the seq function takse a starting number, an ending number, and an interval, and returns a sequence of numbers based on those inputs. For example, to count to 100 by 5:

seq(0, 100, 5)

##  [1]   0   5  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80
## [18]  85  90  95 100

And just like before, we can assign this to a variable and do other stuff with it. For example, we can add a scalar to each component:

y <- seq(0, 100, 5)
y + 2

##  [1]   2   7  12  17  22  27  32  37  42  47  52  57  62  67  72  77  82
## [18]  87  92  97 102

We can multiply the sequence by a scalar:

y * 4

##  [1]   0  20  40  60  80 100 120 140 160 180 200 220 240 260 280 300 320
## [18] 340 360 380 400

Another function is the length function. We can use it to check the number of objects contained in an object:

length(y)

## [1] 21

We can do component-wise addition of two scalars:

z <- seq(1, 20)
y + z

## Warning in y + z: longer object length is not a multiple of shorter object
## length

##  [1]   1   7  13  19  25  31  37  43  49  55  61  67  73  79  85  91  97
## [18] 103 109 115 101

Exercise 1.2

What was the warning we received when we added y and z together? How did R handle it?
When we made the list of numbers y, we provided a third number to indicate the interval to count by. When we made z, we left that out.

Accessing elements

We can access elements of the sequence we created earlier by using bracket notation. For example, to access the second element of y, we can do:

y[2]

## [1] 5

To access the first 5 elements, we can do:

y[1:5]

## [1]  0  5 10 15 20

Working with data

Most of the time you’re working with R, you’ll be manipulating some data that was given to you, like expression data, an OTU abundance table, or yield data. The most common form of data is some sort of rectangular container, such as a data frame, a table, a tibble, a matrix, etc. While there are some differences between all of them, they all function similarly. Note that we’re going to begin our investigation using Base R functionality before we move on to using tidyverse (specifically, dplyr) functions.

We’re going to begin learning how to manipulate data by using mtcars, a dataset that comes packaged with R.

Let’s learn more about the mtcars dataset:

?mtcars

So there’s information about a few different cars here. Let’s take a look at the data:

mtcars

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Most times, datasets are too large to be printed out to the whole screen at once and we just want to get a look at some of the data to get a feel for it. We can use head or tail to look at the first or last few rows of a dataset, respectively:

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

tail(mtcars)

##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

We can check how many rows and columns there are by doing nrows and nols:

nrow(mtcars)

## [1] 32

ncol(mtcars)

## [1] 11

Alternatively, we can use dim, which returns the number of rows and columns as a pair of numbers:

dim(mtcars)

## [1] 32 11

We can use notation similar to what we used earlier to access particular elements of the data frame. For instance, if I wanted the value in the 5th column of the first row, I could access it like this:

mtcars[1, 5]

## [1] 3.9

To access the first fives values in the first row, you can do this:

mtcars[1, 1:5]

##           mpg cyl disp  hp drat
## Mazda RX4  21   6  160 110  3.9

And to get the whole row, you can leave the second argument empty:

mtcars[1, ]

##           mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4

Alternatively, you can access columns by their name by putting the name in quotes:

mtcars[1, 'cyl']

## [1] 6

To access multiple column names in this way, you need to combine (or concatenate) them together with the c function:

mtcars[1, c('mpg', 'cyl', 'disp')]

##           mpg cyl disp
## Mazda RX4  21   6  160

You can do the same kind of syntax to access the rows of the data frame:

mtcars[1:5, 1:5]

##                    mpg cyl disp  hp drat
## Mazda RX4         21.0   6  160 110 3.90
## Mazda RX4 Wag     21.0   6  160 110 3.90
## Datsun 710        22.8   4  108  93 3.85
## Hornet 4 Drive    21.4   6  258 110 3.08
## Hornet Sportabout 18.7   8  360 175 3.15

mtcars["Toyota Corolla", 1:5]

##                 mpg cyl disp hp drat
## Toyota Corolla 33.9   4 71.1 65 4.22

Exercise 1.3

Use rownames to find out all the different cars that are in the dataset. Print out the rows of these cars.
Print out the number of cylinders, quarter-mile time, number of carburetors, and whether or not the engine is v-shaped for the cars you chose above. See the documentation for ?mtcars to figure out which columns have that information.

Subsetting data based on conditions

Let’s say we’re interested in cars with 4 cylinders. The following statement returns a vector of TRUE and FALSE indicating the rows in which the statement cylinders equals 4 is true.

mtcars$cyl == 4

##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [23] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE

We can then use this vector to subset mtcars so that it shows only the cars with 4 cylinders:

mtcars[mtcars$cyl == 4, ]

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Similarly, we can use this to look at cars that have more than 4 cylinders:

mtcars[mtcars$cyl > 4, ]

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

We can use |, the boolean or operator, to look at cars that are 4 or 6 cylinders:

mtcars[mtcars$cyl == 4 | mtcars$cyl == 6, ]

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

And we can combine it with our indexing from earlier to look at the number of cylinders and quarter-mile speed for each of these cars:

mtcars[mtcars$cyl == 4 | mtcars$cyl == 6, c("cyl", "qsec")]

##                cyl  qsec
## Mazda RX4        6 16.46
## Mazda RX4 Wag    6 17.02
## Datsun 710       4 18.61
## Hornet 4 Drive   6 19.44
## Valiant          6 20.22
## Merc 240D        4 20.00
## Merc 230         4 22.90
## Merc 280         6 18.30
## Merc 280C        6 18.90
## Fiat 128         4 19.47
## Honda Civic      4 18.52
## Toyota Corolla   4 19.90
## Toyota Corona    4 20.01
## Fiat X1-9        4 18.90
## Porsche 914-2    4 16.70
## Lotus Europa     4 16.90
## Ferrari Dino     6 15.50
## Volvo 142E       4 18.60

Exercise 1.4

Use the &, the boolean and operator, to subset mtcars by those cars with 6 cylinders and less than 35 miles per gallon.

Getting started with the tidyverse

Everything we did up to now was done using base R and was useful in helping you get a grasp of how data frames work. However, we’re now going to start using the tidyverse suite of functions from here out. If you do some searching online, you’ll find some people with strong opinions about whether or not to use the Tidyverse functions. There are arguments on both sides. For example, On the one hand, Tidyverse functions make certain things a lot easier, like quickly prototyping and editing data analysis pipelines, and it is often easier to read Tidyverse code than base R code. On the other hand, code written using Tidyverse functions are not 100% portable, because your collaborator may not have installed the packages required. We’ll be using the Tidyverse functions when we can, but we will use Base R functionality as required.

Let’s revisit what we did earlier in the subsetting section. We’ll start by importing the tidyverse functions:

library(tidyverse)

The first function we will learn about is select. We can use the select function to select variables in a dataframe. For example, to select the number of cylinders and miles per gallon, we can do:

select(mtcars, 'cyl', 'mpg')

##                     cyl  mpg
## Mazda RX4             6 21.0
## Mazda RX4 Wag         6 21.0
## Datsun 710            4 22.8
## Hornet 4 Drive        6 21.4
## Hornet Sportabout     8 18.7
## Valiant               6 18.1
## Duster 360            8 14.3
## Merc 240D             4 24.4
## Merc 230              4 22.8
## Merc 280              6 19.2
## Merc 280C             6 17.8
## Merc 450SE            8 16.4
## Merc 450SL            8 17.3
## Merc 450SLC           8 15.2
## Cadillac Fleetwood    8 10.4
## Lincoln Continental   8 10.4
## Chrysler Imperial     8 14.7
## Fiat 128              4 32.4
## Honda Civic           4 30.4
## Toyota Corolla        4 33.9
## Toyota Corona         4 21.5
## Dodge Challenger      8 15.5
## AMC Javelin           8 15.2
## Camaro Z28            8 13.3
## Pontiac Firebird      8 19.2
## Fiat X1-9             4 27.3
## Porsche 914-2         4 26.0
## Lotus Europa          4 30.4
## Ford Pantera L        8 15.8
## Ferrari Dino          6 19.7
## Maserati Bora         8 15.0
## Volvo 142E            4 21.4

We can also use select to deselect certain variables; when used in this way, select will return all the columns except for the variables indicated. For example, if we aren’t interested in miles per gallon, number of cylinders, or displacement, we can do:

select(mtcars, -"mpg", -"cyl", -"disp")

##                      hp drat    wt  qsec vs am gear carb
## Mazda RX4           110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       110 3.90 2.875 17.02  0  1    4    4
## Datsun 710           93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   175 3.15 3.440 17.02  0  0    3    2
## Valiant             105 2.76 3.460 20.22  1  0    3    1
## Duster 360          245 3.21 3.570 15.84  0  0    3    4
## Merc 240D            62 3.69 3.190 20.00  1  0    4    2
## Merc 230             95 3.92 3.150 22.90  1  0    4    2
## Merc 280            123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   230 3.23 5.345 17.42  0  0    3    4
## Fiat 128             66 4.08 2.200 19.47  1  1    4    1
## Honda Civic          52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla       65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona        97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9            66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2        91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          109 4.11 2.780 18.60  1  1    4    2

You can even do something like this:

select(mtcars, contains("p"))

##                      mpg  disp  hp
## Mazda RX4           21.0 160.0 110
## Mazda RX4 Wag       21.0 160.0 110
## Datsun 710          22.8 108.0  93
## Hornet 4 Drive      21.4 258.0 110
## Hornet Sportabout   18.7 360.0 175
## Valiant             18.1 225.0 105
## Duster 360          14.3 360.0 245
## Merc 240D           24.4 146.7  62
## Merc 230            22.8 140.8  95
## Merc 280            19.2 167.6 123
## Merc 280C           17.8 167.6 123
## Merc 450SE          16.4 275.8 180
## Merc 450SL          17.3 275.8 180
## Merc 450SLC         15.2 275.8 180
## Cadillac Fleetwood  10.4 472.0 205
## Lincoln Continental 10.4 460.0 215
## Chrysler Imperial   14.7 440.0 230
## Fiat 128            32.4  78.7  66
## Honda Civic         30.4  75.7  52
## Toyota Corolla      33.9  71.1  65
## Toyota Corona       21.5 120.1  97
## Dodge Challenger    15.5 318.0 150
## AMC Javelin         15.2 304.0 150
## Camaro Z28          13.3 350.0 245
## Pontiac Firebird    19.2 400.0 175
## Fiat X1-9           27.3  79.0  66
## Porsche 914-2       26.0 120.3  91
## Lotus Europa        30.4  95.1 113
## Ford Pantera L      15.8 351.0 264
## Ferrari Dino        19.7 145.0 175
## Maserati Bora       15.0 301.0 335
## Volvo 142E          21.4 121.0 109

You can still use column numbers in the select function:

select(mtcars, 1:3)

##                      mpg cyl  disp
## Mazda RX4           21.0   6 160.0
## Mazda RX4 Wag       21.0   6 160.0
## Datsun 710          22.8   4 108.0
## Hornet 4 Drive      21.4   6 258.0
## Hornet Sportabout   18.7   8 360.0
## Valiant             18.1   6 225.0
## Duster 360          14.3   8 360.0
## Merc 240D           24.4   4 146.7
## Merc 230            22.8   4 140.8
## Merc 280            19.2   6 167.6
## Merc 280C           17.8   6 167.6
## Merc 450SE          16.4   8 275.8
## Merc 450SL          17.3   8 275.8
## Merc 450SLC         15.2   8 275.8
## Cadillac Fleetwood  10.4   8 472.0
## Lincoln Continental 10.4   8 460.0
## Chrysler Imperial   14.7   8 440.0
## Fiat 128            32.4   4  78.7
## Honda Civic         30.4   4  75.7
## Toyota Corolla      33.9   4  71.1
## Toyota Corona       21.5   4 120.1
## Dodge Challenger    15.5   8 318.0
## AMC Javelin         15.2   8 304.0
## Camaro Z28          13.3   8 350.0
## Pontiac Firebird    19.2   8 400.0
## Fiat X1-9           27.3   4  79.0
## Porsche 914-2       26.0   4 120.3
## Lotus Europa        30.4   4  95.1
## Ford Pantera L      15.8   8 351.0
## Ferrari Dino        19.7   6 145.0
## Maserati Bora       15.0   8 301.0
## Volvo 142E          21.4   4 121.0

The next function we’re going to learn about is the filter function. This function subsets the data by certain conditions. To use the filter function to select cars with 4 cylinders, we would do:

filter(mtcars, cyl == 4)

##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 2  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 3  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 4  32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 5  30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 6  33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 7  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 8  27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 9  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 10 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 11 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

We can combine conditions just like before:

filter(mtcars, (cyl == 4 | cyl == 8) & mpg < 25)

##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 2  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 3  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 4  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 5  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 6  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 7  17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 8  15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 9  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 10 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 11 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 12 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 13 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## 14 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 15 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 16 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 17 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 18 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## 19 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

We can also combine the functions. For example, to see the miles per gallon, displacement force, and number of cylinders for all the cars with four cylinders, we can do:

filter(select(mtcars, 'mpg', 'cyl', 'disp'), cyl == 4)

##     mpg cyl  disp
## 1  22.8   4 108.0
## 2  24.4   4 146.7
## 3  22.8   4 140.8
## 4  32.4   4  78.7
## 5  30.4   4  75.7
## 6  33.9   4  71.1
## 7  21.5   4 120.1
## 8  27.3   4  79.0
## 9  26.0   4 120.3
## 10 30.4   4  95.1
## 11 21.4   4 121.0

Note that filter discards rownames for data.frames. This is because the row names are not themselves a column. We’ll work later with tibbles, which are functionally similar to data.frames, but have a few enhancements to make working with them easier. For now, in cases where the rownames are important, you can use the subset function in place of filter:

subset(select(mtcars, 'mpg', 'cyl', 'disp'), cyl == 4)

##                 mpg cyl  disp
## Datsun 710     22.8   4 108.0
## Merc 240D      24.4   4 146.7
## Merc 230       22.8   4 140.8
## Fiat 128       32.4   4  78.7
## Honda Civic    30.4   4  75.7
## Toyota Corolla 33.9   4  71.1
## Toyota Corona  21.5   4 120.1
## Fiat X1-9      27.3   4  79.0
## Porsche 914-2  26.0   4 120.3
## Lotus Europa   30.4   4  95.1
## Volvo 142E     21.4   4 121.0

filter and subset are mostly the same and we’ll interchange them as needed in this tutorial. If you’d like to read more about the differences between the two, you can check out this Stack Overflow post.

Note that the code above got complicated very quickly. You can imagine that as you do more and more functions on the same line, it gets easier to misplace a parentheses or a quote and it becomes harder to read. To remedy this, you could use intermediate steps, where you save the result of each function call into a new object:

mtcars.mpg_cyl_disp <- select(mtcars, 'mpg', 'cyl', 'disp')
mtcars.mpg_cyl_disp.four_cylinders <- subset(mtcars.mpg_cyl_disp, cyl == 4)

However, this quickly clutters up your workspace with objects and functions that you don’t necessarily need.

The solution to this is to start using %>%, the pipe operator. This operator takes the object on the left and passes it as the input to the function call on the right. Liberal use of %>% and proper formatting will make your code much easier to read. For example, the above code can be rewritten with the pipe as:

mtcars %>% 
  select('mpg', 'cyl') %>% 
  subset(cyl == 4)

##                 mpg cyl
## Datsun 710     22.8   4
## Merc 240D      24.4   4
## Merc 230       22.8   4
## Fiat 128       32.4   4
## Honda Civic    30.4   4
## Toyota Corolla 33.9   4
## Toyota Corona  21.5   4
## Fiat X1-9      27.3   4
## Porsche 914-2  26.0   4
## Lotus Europa   30.4   4
## Volvo 142E     21.4   4

We can immediately see that this is far more readable and easier to edit. For example, we can quickly add an additional subset call to get only those cars with higher than 25 miles-per-gallon:

mtcars %>% 
  select('mpg', 'cyl') %>% 
  subset(cyl == 4) %>% 
  subset(mpg > 25)

##                 mpg cyl
## Fiat 128       32.4   4
## Honda Civic    30.4   4
## Toyota Corolla 33.9   4
## Fiat X1-9      27.3   4
## Porsche 914-2  26.0   4
## Lotus Europa   30.4   4

It’s a bit of a pain to type %>% by hand over and over, so get comfortable with the shortcut command:

For windows users: control + Shift + m
For Mac users: Cmd + Shift + m

You can find a full list of shortcuts under Tools > Keyboard Shortcuts Help or here.

You can still use everything you learned earlier

mtcars[c("Toyota Corolla", "Honda Civic", "Datsun 710"), ] %>% 
  select("mpg", "cyl")

##                 mpg cyl
## Toyota Corolla 33.9   4
## Honda Civic    30.4   4
## Datsun 710     22.8   4

Exercise 1.5

For this exercise, we’re going to use the data in the nycflights13 package. Let’s take a look at the data:

library(nycflights13)
head(flights)

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## 3  2013     1     1      542            540         2      923
## 4  2013     1     1      544            545        -1     1004
## 5  2013     1     1      554            600        -6      812
## 6  2013     1     1      554            558        -4      740
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

Take a few moments to read through the documentation with ?flights and do the following tasks:

How many flights were there on January 1, 2013?
How many flights were operated by United or Delta?
How many flights arrived more than two hours late?
How many flights from JFK departed more than two hours late?
Read about the arrange function. Use it to find the flights that arrived latest.

Adding new variables, summarizing data:

We’re usually interested in summarizing a data set. For example, we might be interested in the means of each of the variables in mtcars. One way to do this is to use the mean function on each of the columns individually:

mean(mtcars$mpg)

## [1] 20.09062

# mean(mtcars$cyl)
# mean(mtcars$disp)
# etc...

Alternatively, we can just use colMeans on the entire data frame at once:

colMeans(mtcars)

##        mpg        cyl       disp         hp       drat         wt 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250 
##       qsec         vs         am       gear       carb 
##  17.848750   0.437500   0.406250   3.687500   2.812500

Another useful function is summary:

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

From here, we can then look at which cars have higher than average mpg:

mtcars %>% 
  subset(mpg > mean(mtcars$mpg))

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

And we can sort them by mpg:

mtcars %>% 
  subset(mpg > mean(mtcars$mpg)) %>% 
  arrange(mpg) %>% 
  head()

##    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 4 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## 5 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 6 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1

Again, because we’re working with a data.frame, the dplyr functions drop the the row names. To preserve the row names, we can first reorder the data.frame using Base R, then subset it:

mtcars[rev(order(mtcars$mpg)), ] %>% 
  subset(mpg > mean(mtcars$mpg)) %>% 
  head()

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2

Let’s suppose that we wanted to get the average mpg of cars by the number of cylinders it has. One way we can do this is to filter by the number of cylinders, select the mpg column, and then take the mean:

mtcars %>% 
  filter(cyl == 4) %>% 
  select(mpg) %>% 
  summarise(mean_mpg = mean(mpg))

##   mean_mpg
## 1 26.66364

Note summarise, the new function that we used. summarise calculates whatever formula you provide on the data that is passed to it. We’ll see shortly how can we use this to do some more interesting things.

That’s great that we got the mean mpg for the cars with four cylinders, but we still want to know the mean mpg of the other cars. We could repeat this for each of the number of cylinders to get what we need. However, this is not scalable and relies on knowing all the different levels of the variable you’re looking at. This is not feasible for larger datasets. To remedy this, we can use the group_by and summarise commands from tidyverse. You pass the variable you would like group the original data by to the group_by function, and from there, we can use the summarise function to get some summary statistics by passing different formulas to the function call.

For instance, to get the mean mpg by cylinder count, you can do:

mtcars %>% 
  group_by(cyl) %>% 
  summarise(mean = mean(mpg))

## # A tibble: 3 x 2
##     cyl  mean
##   <dbl> <dbl>
## 1     4  26.7
## 2     6  19.7
## 3     8  15.1

Here, we separated the data into groups based on the number of cylinders in the car, then performed the same calculation as above.

We can also add a count column:

mtcars %>% 
  group_by(cyl) %>% 
  summarise(mean = mean(mpg),
            count = n())

## # A tibble: 3 x 3
##     cyl  mean count
##   <dbl> <dbl> <int>
## 1     4  26.7    11
## 2     6  19.7     7
## 3     8  15.1    14

Note that if we were just interested in the number of observations per group, we can simply use count:

mtcars %>% 
  group_by(cyl) %>% 
  count()

## # A tibble: 3 x 2
## # Groups:   cyl [3]
##     cyl     n
##   <dbl> <int>
## 1     4    11
## 2     6     7
## 3     8    14

Exercise 1.6

For the mtcars dataset:

What is the average horsepower, weight, and displacement by cylinder count?
How many cars are there with automatic transmission?
What is the count of cars by gear count?
What is the breakdown of car count by transmission type and gear count? Hint: you can group by more than one variable at a time. Notice anything interesting?

For the flights dataset:

Which carrier had the most flights that departed later than two hours?
which carrier had the lowest average arrival delay?
which destination airport had the most arrivals later than 30 minutes?
Use n_distinct to find out how many distinct carriers and destinations there are.
Which route (origin-destination pair) had the most flights?
Which day had the most flights?
Which day had the arrival delays greater than 30 minutes?

Adding new variables with mutate

The mutate function lets us add new variables to a data object by specifying a formula. For instance, to add a new column to mtcars showing the weight of the car in pounds (instead of thousands of pounds), we can do:

mtcars %>% 
  mutate(weight_in_pounds = 1000 * wt) %>% 
  head()

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb weight_in_pounds
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4             2620
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4             2875
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1             2320
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1             3215
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2             3440
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1             3460

We can add multiple variables at once. For example:

mtcars %>% 
  mutate(weight_in_pounds = 1000 * wt,
         hp_per_cylinder = hp / cyl) %>% 
  head()

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb weight_in_pounds
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4             2620
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4             2875
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1             2320
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1             3215
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2             3440
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1             3460
##   hp_per_cylinder
## 1        18.33333
## 2        18.33333
## 3        23.25000
## 4        18.33333
## 5        21.87500
## 6        17.50000

And from there, we can do all the other things that we have learned so far:

mtcars %>% 
  mutate(weight_in_pounds = 1000 * wt,
         hp_per_cylinder = hp / cyl) %>% 
  select(hp, cyl, hp_per_cylinder) %>% 
  arrange(desc(hp_per_cylinder))

##     hp cyl hp_per_cylinder
## 1  335   8        41.87500
## 2  264   8        33.00000
## 3  245   8        30.62500
## 4  245   8        30.62500
## 5  175   6        29.16667
## 6  230   8        28.75000
## 7  113   4        28.25000
## 8  109   4        27.25000
## 9  215   8        26.87500
## 10 205   8        25.62500
## 11  97   4        24.25000
## 12  95   4        23.75000
## 13  93   4        23.25000
## 14  91   4        22.75000
## 15 180   8        22.50000
## 16 180   8        22.50000
## 17 180   8        22.50000
## 18 175   8        21.87500
## 19 175   8        21.87500
## 20 123   6        20.50000
## 21 123   6        20.50000
## 22 150   8        18.75000
## 23 150   8        18.75000
## 24 110   6        18.33333
## 25 110   6        18.33333
## 26 110   6        18.33333
## 27 105   6        17.50000
## 28  66   4        16.50000
## 29  66   4        16.50000
## 30  65   4        16.25000
## 31  62   4        15.50000
## 32  52   4        13.00000

Exercise 1.7

For the mtcars:

What is the average hp_per_cylinder by cylinder count?

For the flights dataset:

Create a new variable gain, defined as the the difference between the departure delay and arrival delay of a flight. What does this variable represent? What is the average gain for all flights in 2013? By airline?
Which routes had the worst (ie, largest) average gain?

Next section: ggplot2 basics

Paul Villanueva
Ph.D. Student - Bioinformatics and Computational Biology
Iowa State University, Ames, IA.