The normal mathematic operators work in R, and they follow the order of operations:
1 / 20 * 30
## [1] 1.5
You can use parentheses to change the order in which the expressions are calculated:
1 / (20 * 30)
## [1] 0.001666667
cos(2 * pi)
## [1] 1
You can assign values to objects with <-
, the assignment operator. The shortcut for this is:
Alt
+ -
Cmd
+ -
After assignment, we can perform whatever operations we would like with that variable:
x <- 5
x + 5
## [1] 10
sin(2 * x)
## [1] -0.5440211
x * 500 / sin(x)
## [1] -2607.088
R also has the modulus operator %%
, which returns the remainder when x
is divided by y
. For example:
5 %% 2
## [1] 1
Every programming language has functions. In R, they look something like:
function_name(arg_1 = ..., arg_2 =..., ...)
For example, the seq
function takse a starting number, an ending number, and an interval, and returns a sequence of numbers based on those inputs. For example, to count to 100 by 5:
seq(0, 100, 5)
## [1] 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
## [18] 85 90 95 100
And just like before, we can assign this to a variable and do other stuff with it. For example, we can add a scalar to each component:
y <- seq(0, 100, 5)
y + 2
## [1] 2 7 12 17 22 27 32 37 42 47 52 57 62 67 72 77 82
## [18] 87 92 97 102
We can multiply the sequence by a scalar:
y * 4
## [1] 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320
## [18] 340 360 380 400
Another function is the length
function. We can use it to check the number of objects contained in an object:
length(y)
## [1] 21
We can do component-wise addition of two scalars:
z <- seq(1, 20)
y + z
## Warning in y + z: longer object length is not a multiple of shorter object
## length
## [1] 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97
## [18] 103 109 115 101
y
and z
together? How did R handle it?y
, we provided a third number to indicate the interval to count by. When we made z
, we left that out.We can access elements of the sequence we created earlier by using bracket notation. For example, to access the second element of y
, we can do:
y[2]
## [1] 5
To access the first 5 elements, we can do:
y[1:5]
## [1] 0 5 10 15 20
Most of the time you’re working with R, you’ll be manipulating some data that was given to you, like expression data, an OTU abundance table, or yield data. The most common form of data is some sort of rectangular container, such as a data frame, a table, a tibble, a matrix, etc. While there are some differences between all of them, they all function similarly. Note that we’re going to begin our investigation using Base R functionality before we move on to using tidyverse
(specifically, dplyr
) functions.
We’re going to begin learning how to manipulate data by using mtcars
, a dataset that comes packaged with R.
Let’s learn more about the mtcars
dataset:
?mtcars
So there’s information about a few different cars here. Let’s take a look at the data:
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Most times, datasets are too large to be printed out to the whole screen at once and we just want to get a look at some of the data to get a feel for it. We can use head
or tail
to look at the first or last few rows of a dataset, respectively:
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
We can check how many rows and columns there are by doing nrows and nols:
nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11
Alternatively, we can use dim
, which returns the number of rows and columns as a pair of numbers:
dim(mtcars)
## [1] 32 11
We can use notation similar to what we used earlier to access particular elements of the data frame. For instance, if I wanted the value in the 5th column of the first row, I could access it like this:
mtcars[1, 5]
## [1] 3.9
To access the first fives values in the first row, you can do this:
mtcars[1, 1:5]
## mpg cyl disp hp drat
## Mazda RX4 21 6 160 110 3.9
And to get the whole row, you can leave the second argument empty:
mtcars[1, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
Alternatively, you can access columns by their name by putting the name in quotes:
mtcars[1, 'cyl']
## [1] 6
To access multiple column names in this way, you need to combine (or concatenate) them together with the c
function:
mtcars[1, c('mpg', 'cyl', 'disp')]
## mpg cyl disp
## Mazda RX4 21 6 160
You can do the same kind of syntax to access the rows of the data frame:
mtcars[1:5, 1:5]
## mpg cyl disp hp drat
## Mazda RX4 21.0 6 160 110 3.90
## Mazda RX4 Wag 21.0 6 160 110 3.90
## Datsun 710 22.8 4 108 93 3.85
## Hornet 4 Drive 21.4 6 258 110 3.08
## Hornet Sportabout 18.7 8 360 175 3.15
mtcars["Toyota Corolla", 1:5]
## mpg cyl disp hp drat
## Toyota Corolla 33.9 4 71.1 65 4.22
rownames
to find out all the different cars that are in the dataset. Print out the rows of these cars.?mtcars
to figure out which columns have that information.Let’s say we’re interested in cars with 4 cylinders. The following statement returns a vector of TRUE
and FALSE
indicating the rows in which the statement cylinders equals 4
is true.
mtcars$cyl == 4
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [23] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
We can then use this vector to subset mtcars
so that it shows only the cars with 4 cylinders:
mtcars[mtcars$cyl == 4, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Similarly, we can use this to look at cars that have more than 4 cylinders:
mtcars[mtcars$cyl > 4, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
We can use |
, the boolean or
operator, to look at cars that are 4 or 6 cylinders:
mtcars[mtcars$cyl == 4 | mtcars$cyl == 6, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
And we can combine it with our indexing from earlier to look at the number of cylinders and quarter-mile speed for each of these cars:
mtcars[mtcars$cyl == 4 | mtcars$cyl == 6, c("cyl", "qsec")]
## cyl qsec
## Mazda RX4 6 16.46
## Mazda RX4 Wag 6 17.02
## Datsun 710 4 18.61
## Hornet 4 Drive 6 19.44
## Valiant 6 20.22
## Merc 240D 4 20.00
## Merc 230 4 22.90
## Merc 280 6 18.30
## Merc 280C 6 18.90
## Fiat 128 4 19.47
## Honda Civic 4 18.52
## Toyota Corolla 4 19.90
## Toyota Corona 4 20.01
## Fiat X1-9 4 18.90
## Porsche 914-2 4 16.70
## Lotus Europa 4 16.90
## Ferrari Dino 6 15.50
## Volvo 142E 4 18.60
&
, the boolean and
operator, to subset mtcars
by those cars with 6 cylinders and less than 35 miles per gallon.Everything we did up to now was done using base R and was useful in helping you get a grasp of how data frames work. However, we’re now going to start using the tidyverse suite of functions from here out. If you do some searching online, you’ll find some people with strong opinions about whether or not to use the Tidyverse functions. There are arguments on both sides. For example, On the one hand, Tidyverse functions make certain things a lot easier, like quickly prototyping and editing data analysis pipelines, and it is often easier to read Tidyverse code than base R code. On the other hand, code written using Tidyverse functions are not 100% portable, because your collaborator may not have installed the packages required. We’ll be using the Tidyverse functions when we can, but we will use Base R functionality as required.
Let’s revisit what we did earlier in the subsetting section. We’ll start by importing the tidyverse functions:
library(tidyverse)
The first function we will learn about is select
. We can use the select
function to select variables in a dataframe. For example, to select the number of cylinders and miles per gallon, we can do:
select(mtcars, 'cyl', 'mpg')
## cyl mpg
## Mazda RX4 6 21.0
## Mazda RX4 Wag 6 21.0
## Datsun 710 4 22.8
## Hornet 4 Drive 6 21.4
## Hornet Sportabout 8 18.7
## Valiant 6 18.1
## Duster 360 8 14.3
## Merc 240D 4 24.4
## Merc 230 4 22.8
## Merc 280 6 19.2
## Merc 280C 6 17.8
## Merc 450SE 8 16.4
## Merc 450SL 8 17.3
## Merc 450SLC 8 15.2
## Cadillac Fleetwood 8 10.4
## Lincoln Continental 8 10.4
## Chrysler Imperial 8 14.7
## Fiat 128 4 32.4
## Honda Civic 4 30.4
## Toyota Corolla 4 33.9
## Toyota Corona 4 21.5
## Dodge Challenger 8 15.5
## AMC Javelin 8 15.2
## Camaro Z28 8 13.3
## Pontiac Firebird 8 19.2
## Fiat X1-9 4 27.3
## Porsche 914-2 4 26.0
## Lotus Europa 4 30.4
## Ford Pantera L 8 15.8
## Ferrari Dino 6 19.7
## Maserati Bora 8 15.0
## Volvo 142E 4 21.4
We can also use select
to deselect certain variables; when used in this way, select
will return all the columns except for the variables indicated. For example, if we aren’t interested in miles per gallon, number of cylinders, or displacement, we can do:
select(mtcars, -"mpg", -"cyl", -"disp")
## hp drat wt qsec vs am gear carb
## Mazda RX4 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 175 3.15 3.440 17.02 0 0 3 2
## Valiant 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 109 4.11 2.780 18.60 1 1 4 2
You can even do something like this:
select(mtcars, contains("p"))
## mpg disp hp
## Mazda RX4 21.0 160.0 110
## Mazda RX4 Wag 21.0 160.0 110
## Datsun 710 22.8 108.0 93
## Hornet 4 Drive 21.4 258.0 110
## Hornet Sportabout 18.7 360.0 175
## Valiant 18.1 225.0 105
## Duster 360 14.3 360.0 245
## Merc 240D 24.4 146.7 62
## Merc 230 22.8 140.8 95
## Merc 280 19.2 167.6 123
## Merc 280C 17.8 167.6 123
## Merc 450SE 16.4 275.8 180
## Merc 450SL 17.3 275.8 180
## Merc 450SLC 15.2 275.8 180
## Cadillac Fleetwood 10.4 472.0 205
## Lincoln Continental 10.4 460.0 215
## Chrysler Imperial 14.7 440.0 230
## Fiat 128 32.4 78.7 66
## Honda Civic 30.4 75.7 52
## Toyota Corolla 33.9 71.1 65
## Toyota Corona 21.5 120.1 97
## Dodge Challenger 15.5 318.0 150
## AMC Javelin 15.2 304.0 150
## Camaro Z28 13.3 350.0 245
## Pontiac Firebird 19.2 400.0 175
## Fiat X1-9 27.3 79.0 66
## Porsche 914-2 26.0 120.3 91
## Lotus Europa 30.4 95.1 113
## Ford Pantera L 15.8 351.0 264
## Ferrari Dino 19.7 145.0 175
## Maserati Bora 15.0 301.0 335
## Volvo 142E 21.4 121.0 109
You can still use column numbers in the select
function:
select(mtcars, 1:3)
## mpg cyl disp
## Mazda RX4 21.0 6 160.0
## Mazda RX4 Wag 21.0 6 160.0
## Datsun 710 22.8 4 108.0
## Hornet 4 Drive 21.4 6 258.0
## Hornet Sportabout 18.7 8 360.0
## Valiant 18.1 6 225.0
## Duster 360 14.3 8 360.0
## Merc 240D 24.4 4 146.7
## Merc 230 22.8 4 140.8
## Merc 280 19.2 6 167.6
## Merc 280C 17.8 6 167.6
## Merc 450SE 16.4 8 275.8
## Merc 450SL 17.3 8 275.8
## Merc 450SLC 15.2 8 275.8
## Cadillac Fleetwood 10.4 8 472.0
## Lincoln Continental 10.4 8 460.0
## Chrysler Imperial 14.7 8 440.0
## Fiat 128 32.4 4 78.7
## Honda Civic 30.4 4 75.7
## Toyota Corolla 33.9 4 71.1
## Toyota Corona 21.5 4 120.1
## Dodge Challenger 15.5 8 318.0
## AMC Javelin 15.2 8 304.0
## Camaro Z28 13.3 8 350.0
## Pontiac Firebird 19.2 8 400.0
## Fiat X1-9 27.3 4 79.0
## Porsche 914-2 26.0 4 120.3
## Lotus Europa 30.4 4 95.1
## Ford Pantera L 15.8 8 351.0
## Ferrari Dino 19.7 6 145.0
## Maserati Bora 15.0 8 301.0
## Volvo 142E 21.4 4 121.0
The next function we’re going to learn about is the filter
function. This function subsets the data by certain conditions. To use the filter
function to select cars with 4 cylinders, we would do:
filter(mtcars, cyl == 4)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 4 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 5 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 6 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 7 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 8 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 9 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## 10 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## 11 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
We can combine conditions just like before:
filter(mtcars, (cyl == 4 | cyl == 8) & mpg < 25)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 2 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 3 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 4 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 5 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 6 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## 7 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## 8 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## 9 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## 10 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## 11 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## 12 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 13 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## 14 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## 15 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## 16 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## 17 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## 18 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## 19 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
We can also combine the functions. For example, to see the miles per gallon, displacement force, and number of cylinders for all the cars with four cylinders, we can do:
filter(select(mtcars, 'mpg', 'cyl', 'disp'), cyl == 4)
## mpg cyl disp
## 1 22.8 4 108.0
## 2 24.4 4 146.7
## 3 22.8 4 140.8
## 4 32.4 4 78.7
## 5 30.4 4 75.7
## 6 33.9 4 71.1
## 7 21.5 4 120.1
## 8 27.3 4 79.0
## 9 26.0 4 120.3
## 10 30.4 4 95.1
## 11 21.4 4 121.0
Note that filter
discards rownames for data.frames
. This is because the row names are not themselves a column. We’ll work later with tibbles
, which are functionally similar to data.frames
, but have a few enhancements to make working with them easier. For now, in cases where the rownames are important, you can use the subset
function in place of filter:
subset(select(mtcars, 'mpg', 'cyl', 'disp'), cyl == 4)
## mpg cyl disp
## Datsun 710 22.8 4 108.0
## Merc 240D 24.4 4 146.7
## Merc 230 22.8 4 140.8
## Fiat 128 32.4 4 78.7
## Honda Civic 30.4 4 75.7
## Toyota Corolla 33.9 4 71.1
## Toyota Corona 21.5 4 120.1
## Fiat X1-9 27.3 4 79.0
## Porsche 914-2 26.0 4 120.3
## Lotus Europa 30.4 4 95.1
## Volvo 142E 21.4 4 121.0
filter
and subset
are mostly the same and we’ll interchange them as needed in this tutorial. If you’d like to read more about the differences between the two, you can check out this Stack Overflow post.
Note that the code above got complicated very quickly. You can imagine that as you do more and more functions on the same line, it gets easier to misplace a parentheses or a quote and it becomes harder to read. To remedy this, you could use intermediate steps, where you save the result of each function call into a new object:
mtcars.mpg_cyl_disp <- select(mtcars, 'mpg', 'cyl', 'disp')
mtcars.mpg_cyl_disp.four_cylinders <- subset(mtcars.mpg_cyl_disp, cyl == 4)
However, this quickly clutters up your workspace with objects and functions that you don’t necessarily need.
The solution to this is to start using %>%
, the pipe operator. This operator takes the object on the left and passes it as the input to the function call on the right. Liberal use of %>%
and proper formatting will make your code much easier to read. For example, the above code can be rewritten with the pipe as:
mtcars %>%
select('mpg', 'cyl') %>%
subset(cyl == 4)
## mpg cyl
## Datsun 710 22.8 4
## Merc 240D 24.4 4
## Merc 230 22.8 4
## Fiat 128 32.4 4
## Honda Civic 30.4 4
## Toyota Corolla 33.9 4
## Toyota Corona 21.5 4
## Fiat X1-9 27.3 4
## Porsche 914-2 26.0 4
## Lotus Europa 30.4 4
## Volvo 142E 21.4 4
We can immediately see that this is far more readable and easier to edit. For example, we can quickly add an additional subset
call to get only those cars with higher than 25 miles-per-gallon:
mtcars %>%
select('mpg', 'cyl') %>%
subset(cyl == 4) %>%
subset(mpg > 25)
## mpg cyl
## Fiat 128 32.4 4
## Honda Civic 30.4 4
## Toyota Corolla 33.9 4
## Fiat X1-9 27.3 4
## Porsche 914-2 26.0 4
## Lotus Europa 30.4 4
It’s a bit of a pain to type %>%
by hand over and over, so get comfortable with the shortcut command:
control
+ Shift
+ m
Cmd
+ Shift
+ m
You can find a full list of shortcuts under Tools
> Keyboard Shortcuts Help
or here.
You can still use everything you learned earlier
mtcars[c("Toyota Corolla", "Honda Civic", "Datsun 710"), ] %>%
select("mpg", "cyl")
## mpg cyl
## Toyota Corolla 33.9 4
## Honda Civic 30.4 4
## Datsun 710 22.8 4
For this exercise, we’re going to use the data in the nycflights13
package. Let’s take a look at the data:
library(nycflights13)
head(flights)
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
Take a few moments to read through the documentation with ?flights
and do the following tasks:
arrange
function. Use it to find the flights that arrived latest.We’re usually interested in summarizing a data set. For example, we might be interested in the means of each of the variables in mtcars
. One way to do this is to use the mean function on each of the columns individually:
mean(mtcars$mpg)
## [1] 20.09062
# mean(mtcars$cyl)
# mean(mtcars$disp)
# etc...
Alternatively, we can just use colMeans
on the entire data frame at once:
colMeans(mtcars)
## mpg cyl disp hp drat wt
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
## qsec vs am gear carb
## 17.848750 0.437500 0.406250 3.687500 2.812500
Another useful function is summary
:
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
From here, we can then look at which cars have higher than average mpg:
mtcars %>%
subset(mpg > mean(mtcars$mpg))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
And we can sort them by mpg
:
mtcars %>%
subset(mpg > mean(mtcars$mpg)) %>%
arrange(mpg) %>%
head()
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 4 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## 5 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## 6 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Again, because we’re working with a data.frame
, the dplyr functions drop the the row names. To preserve the row names, we can first reorder the data.frame
using Base R, then subset it:
mtcars[rev(order(mtcars$mpg)), ] %>%
subset(mpg > mean(mtcars$mpg)) %>%
head()
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Let’s suppose that we wanted to get the average mpg
of cars by the number of cylinders it has. One way we can do this is to filter by the number of cylinders, select the mpg
column, and then take the mean:
mtcars %>%
filter(cyl == 4) %>%
select(mpg) %>%
summarise(mean_mpg = mean(mpg))
## mean_mpg
## 1 26.66364
Note summarise
, the new function that we used. summarise
calculates whatever formula you provide on the data that is passed to it. We’ll see shortly how can we use this to do some more interesting things.
That’s great that we got the mean mpg for the cars with four cylinders, but we still want to know the mean mpg of the other cars. We could repeat this for each of the number of cylinders to get what we need. However, this is not scalable and relies on knowing all the different levels of the variable you’re looking at. This is not feasible for larger datasets. To remedy this, we can use the group_by
and summarise
commands from tidyverse
. You pass the variable you would like group the original data by to the group_by
function, and from there, we can use the summarise function to get some summary statistics by passing different formulas to the function call.
For instance, to get the mean mpg by cylinder count, you can do:
mtcars %>%
group_by(cyl) %>%
summarise(mean = mean(mpg))
## # A tibble: 3 x 2
## cyl mean
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
Here, we separated the data into groups based on the number of cylinders in the car, then performed the same calculation as above.
We can also add a count column:
mtcars %>%
group_by(cyl) %>%
summarise(mean = mean(mpg),
count = n())
## # A tibble: 3 x 3
## cyl mean count
## <dbl> <dbl> <int>
## 1 4 26.7 11
## 2 6 19.7 7
## 3 8 15.1 14
Note that if we were just interested in the number of observations per group, we can simply use count
:
mtcars %>%
group_by(cyl) %>%
count()
## # A tibble: 3 x 2
## # Groups: cyl [3]
## cyl n
## <dbl> <int>
## 1 4 11
## 2 6 7
## 3 8 14
For the mtcars
dataset:
For the flights
dataset:
n_distinct
to find out how many distinct carriers and destinations there are.The mutate
function lets us add new variables to a data object by specifying a formula. For instance, to add a new column to mtcars
showing the weight of the car in pounds (instead of thousands of pounds), we can do:
mtcars %>%
mutate(weight_in_pounds = 1000 * wt) %>%
head()
## mpg cyl disp hp drat wt qsec vs am gear carb weight_in_pounds
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2620
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2875
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 2320
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3215
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3440
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 3460
We can add multiple variables at once. For example:
mtcars %>%
mutate(weight_in_pounds = 1000 * wt,
hp_per_cylinder = hp / cyl) %>%
head()
## mpg cyl disp hp drat wt qsec vs am gear carb weight_in_pounds
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2620
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2875
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 2320
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3215
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3440
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 3460
## hp_per_cylinder
## 1 18.33333
## 2 18.33333
## 3 23.25000
## 4 18.33333
## 5 21.87500
## 6 17.50000
And from there, we can do all the other things that we have learned so far:
mtcars %>%
mutate(weight_in_pounds = 1000 * wt,
hp_per_cylinder = hp / cyl) %>%
select(hp, cyl, hp_per_cylinder) %>%
arrange(desc(hp_per_cylinder))
## hp cyl hp_per_cylinder
## 1 335 8 41.87500
## 2 264 8 33.00000
## 3 245 8 30.62500
## 4 245 8 30.62500
## 5 175 6 29.16667
## 6 230 8 28.75000
## 7 113 4 28.25000
## 8 109 4 27.25000
## 9 215 8 26.87500
## 10 205 8 25.62500
## 11 97 4 24.25000
## 12 95 4 23.75000
## 13 93 4 23.25000
## 14 91 4 22.75000
## 15 180 8 22.50000
## 16 180 8 22.50000
## 17 180 8 22.50000
## 18 175 8 21.87500
## 19 175 8 21.87500
## 20 123 6 20.50000
## 21 123 6 20.50000
## 22 150 8 18.75000
## 23 150 8 18.75000
## 24 110 6 18.33333
## 25 110 6 18.33333
## 26 110 6 18.33333
## 27 105 6 17.50000
## 28 66 4 16.50000
## 29 66 4 16.50000
## 30 65 4 16.25000
## 31 62 4 15.50000
## 32 52 4 13.00000
For the mtcars
:
hp_per_cylinder
by cylinder count?For the flights
dataset:
gain
, defined as the the difference between the departure delay and arrival delay of a flight. What does this variable represent? What is the average gain for all flights in 2013? By airline?Next section: ggplot2 basics
Paul Villanueva
Ph.D. Student - Bioinformatics and Computational Biology
Iowa State University, Ames, IA.