According to the first-order difference criterion, the optimal clustering similarity was 98%. However, for various reasons (mostly due to physical limitations), we ended up choosing a 95% similarity threshold.
After determining this, we then used the representative sequences found in 0.95.fa
and queried them against our metagenomes. Thus, for each metagenome, we had an output file containing all the relevant hits. We then used a Python script to summarize the abundance and presence of each these representative sequences across our metagenomes. This information can be found in data/abundance_presence.tsv
.
Our job now is to determine which of these clusters to include. While it would be nice to design primers for all of the sequences and
Our goal is to reproduce the following graph:
We’ll begin by learning about visualizations. I have an introduction to ggplot2 here, but we will begin with quick review before we start tackling this one.
We’ll begin by loading in our required packages.
library(ggplot2)
library(tidyverse)
Before we tackle our goal, let’s warm up on mtcars
.
Let’s load in the tft_stats.tsv
data set.
tft_stats <- read.delim("data/practice_data/tft_stats.tsv")
tft_stats
## rank percent
## 1 Iron IV 0.00
## 2 Iron III 0.08
## 3 Iron II 6.61
## 4 Iron I 14.59
## 5 Bronze IV 17.37
## 6 Bronze III 14.47
## 7 Bronze II 10.56
## 8 Bronze I 7.74
## 9 Silver IV 6.72
## 10 Silver III 5.02
## 11 Silver II 4.02
## 12 Silver I 2.91
## 13 Gold IV 3.15
## 14 Gold III 2.18
## 15 Gold II 1.58
## 16 Gold I 0.96
## 17 Platinum IV 1.07
## 18 Platinum III 0.44
## 19 Platinum II 0.23
## 20 Platinum I 0.11
## 21 Diamond IV 0.12
## 22 Diamond III 0.04
## 23 Diamond II 0.02
## 24 Diamond I 0.01
## 25 Master - 0.00
## 26 GrandMaster - 0.00
## 27 Challenger - 0.00
We can quickly visualize this by throwing it into ggplot
:
tft_stats %>%
ggplot(aes(rank, percent)) +
geom_col() +
theme(axis.text.x = element_text(angle = 90))
Note that ggplot
will automatically reorder your data if it doesn’t think it’s sorted. We can get around this by telling ggplot
that the data already is in the correct order (adapted from this Stack Overflow post):
tft_stats$rank <- factor(tft_stats$rank, levels = tft_stats$rank)
Now, when we plot the data frame, the order will remain:
tft_stats %>%
ggplot(aes(rank, percent)) +
geom_col() +
theme(axis.text.x = element_text(angle = 90))
Let’s go ahead and fix this up:
tft_stats %>%
ggplot(aes(rank, percent)) +
theme_light() +
geom_col() +
labs(x = "Rank",
y = "Percent of player base",
title = "Distribution of players across ranks in Team Fight Tactics",
subtitle = "through August 2019; source: lolchess.gg") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
We’re going to use the mutate
function to add some more information to this plot. mutate
lets us add columns to a data frame based on other columns. For instance, we can add a column to mtcars
to store the ratio between hp
and wt
:
mtcars %>%
mutate(hp_to_wt = hp / wt) %>%
head()
## mpg cyl disp hp drat wt qsec vs am gear carb hp_to_wt
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 41.98473
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 38.26087
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 40.08621
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 34.21462
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 50.87209
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 30.34682
And we can go ahead and plot this new variable right away:
mtcars %>%
mutate(hp_to_wt = hp / wt) %>%
ggplot(aes(x = hp / wt, y = qsec)) +
geom_point() +
labs(x = "Horsepower-to-weight ratio (hp / lb)",
y = "Quarter-mile time (seconds)",
title = "Horsepower-to-weight ratio vs. quarter-mile time") +
theme_light()
Just like with summarise
, we can create multiple new variables at the same time with mutate
:
mtcars %>%
mutate(hp_to_wt = hp / wt,
proj_hsec = 2 * qsec)
## mpg cyl disp hp drat wt qsec vs am gear carb hp_to_wt proj_hsec
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 41.98473 32.92
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 38.26087 34.04
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 40.08621 37.22
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 34.21462 38.88
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 50.87209 34.04
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 30.34682 40.44
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 68.62745 31.68
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 19.43574 40.00
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 30.15873 45.80
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 35.75581 36.60
## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 35.75581 37.80
## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 44.22604 34.80
## 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 48.25737 35.20
## 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 47.61905 36.00
## 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 39.04762 35.96
## 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 39.63864 35.64
## 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 43.03087 34.84
## 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 30.00000 38.94
## 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 32.19814 37.04
## 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 35.42234 39.80
## 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 39.35091 40.02
## 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 42.61364 33.74
## 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 43.66812 34.60
## 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 63.80208 30.82
## 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 45.51365 34.10
## 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 34.10853 37.80
## 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 42.52336 33.40
## 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 74.68605 33.80
## 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 83.28076 29.00
## 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 63.17690 31.00
## 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 93.83754 29.20
## 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 39.20863 37.20
Let’s go ahead and use mutate
to add a cumulative percentage column to tft_stats
. This is done with the cumsum
function:
tft_stats %>%
mutate(cumulative_percent = cumsum(percent))
## rank percent cumulative_percent
## 1 Iron IV 0.00 0.00
## 2 Iron III 0.08 0.08
## 3 Iron II 6.61 6.69
## 4 Iron I 14.59 21.28
## 5 Bronze IV 17.37 38.65
## 6 Bronze III 14.47 53.12
## 7 Bronze II 10.56 63.68
## 8 Bronze I 7.74 71.42
## 9 Silver IV 6.72 78.14
## 10 Silver III 5.02 83.16
## 11 Silver II 4.02 87.18
## 12 Silver I 2.91 90.09
## 13 Gold IV 3.15 93.24
## 14 Gold III 2.18 95.42
## 15 Gold II 1.58 97.00
## 16 Gold I 0.96 97.96
## 17 Platinum IV 1.07 99.03
## 18 Platinum III 0.44 99.47
## 19 Platinum II 0.23 99.70
## 20 Platinum I 0.11 99.81
## 21 Diamond IV 0.12 99.93
## 22 Diamond III 0.04 99.97
## 23 Diamond II 0.02 99.99
## 24 Diamond I 0.01 100.00
## 25 Master - 0.00 100.00
## 26 GrandMaster - 0.00 100.00
## 27 Challenger - 0.00 100.00
And we can now add this line to our graph as another layer:
tft_stats %>%
mutate(cumulative = cumsum(percent)) %>%
ggplot(aes(x = rank, group = 1)) +
theme_light() +
geom_col(aes(y = percent)) +
geom_line(aes(y = cumulative), color = "red")
There are several things wrong with the tft_stats
graph.
mutate
to correct this.max
function) will get you part of the way there. What can you multiply this rescaled number by to fit with the rest of the graph? What do you think about the clarity of this graph?Read in abundance_presence.tsv
with:
presence_abundance <- read.delim("data/abundance_presence.tsv")
Recreate the graph below:
Some hints:
tft_stats
graph with the following formula:tft_stats %>%
mutate(percent = percent / 100,
cumulative = cumsum(percent) * max(percent))
## rank percent cumulative
## 1 Iron IV 0.0000 0.00000000
## 2 Iron III 0.0008 0.00013896
## 3 Iron II 0.0661 0.01162053
## 4 Iron I 0.1459 0.03696336
## 5 Bronze IV 0.1737 0.06713505
## 6 Bronze III 0.1447 0.09226944
## 7 Bronze II 0.1056 0.11061216
## 8 Bronze I 0.0774 0.12405654
## 9 Silver IV 0.0672 0.13572918
## 10 Silver III 0.0502 0.14444892
## 11 Silver II 0.0402 0.15143166
## 12 Silver I 0.0291 0.15648633
## 13 Gold IV 0.0315 0.16195788
## 14 Gold III 0.0218 0.16574454
## 15 Gold II 0.0158 0.16848900
## 16 Gold I 0.0096 0.17015652
## 17 Platinum IV 0.0107 0.17201511
## 18 Platinum III 0.0044 0.17277939
## 19 Platinum II 0.0023 0.17317890
## 20 Platinum I 0.0011 0.17336997
## 21 Diamond IV 0.0012 0.17357841
## 22 Diamond III 0.0004 0.17364789
## 23 Diamond II 0.0002 0.17368263
## 24 Diamond I 0.0001 0.17370000
## 25 Master - 0.0000 0.17370000
## 26 GrandMaster - 0.0000 0.17370000
## 27 Challenger - 0.0000 0.17370000
However, if you try to do the same thing with the presence_abundance
data frame, you won’t get the desired result. This is because the formula above is actually shorthand for:
tft_stats %>%
mutate(percent = percent / 100,
cumulative = cumsum(percent) / 1 * max(percent))
## rank percent cumulative
## 1 Iron IV 0.0000 0.00000000
## 2 Iron III 0.0008 0.00013896
## 3 Iron II 0.0661 0.01162053
## 4 Iron I 0.1459 0.03696336
## 5 Bronze IV 0.1737 0.06713505
## 6 Bronze III 0.1447 0.09226944
## 7 Bronze II 0.1056 0.11061216
## 8 Bronze I 0.0774 0.12405654
## 9 Silver IV 0.0672 0.13572918
## 10 Silver III 0.0502 0.14444892
## 11 Silver II 0.0402 0.15143166
## 12 Silver I 0.0291 0.15648633
## 13 Gold IV 0.0315 0.16195788
## 14 Gold III 0.0218 0.16574454
## 15 Gold II 0.0158 0.16848900
## 16 Gold I 0.0096 0.17015652
## 17 Platinum IV 0.0107 0.17201511
## 18 Platinum III 0.0044 0.17277939
## 19 Platinum II 0.0023 0.17317890
## 20 Platinum I 0.0011 0.17336997
## 21 Diamond IV 0.0012 0.17357841
## 22 Diamond III 0.0004 0.17364789
## 23 Diamond II 0.0002 0.17368263
## 24 Diamond I 0.0001 0.17370000
## 25 Master - 0.0000 0.17370000
## 26 GrandMaster - 0.0000 0.17370000
## 27 Challenger - 0.0000 0.17370000
What does the 1 represent? How can we apply that to presence_abundance
?
We have most of what we need to make the desired visualization. All we need now is to replace the black line with the first-order differences, plot that, and put a vertical line to show the inclusion threshold. Just like when we calculated the first-order difference on the command line, the actual code is beyond the scope of this class. Run the following lines to add the first-order difference information to the presence_abundance
data frame:
presence_abundance <- presence_abundance %>%
mutate(cumulative_percent = cumsum(abundance) / sum(abundance) * max(abundance))
fo_difference <- function(pos){
left <- (presence_abundance[pos, 4] - presence_abundance[1, 4]) / pos
right <- (presence_abundance[nrow(presence_abundance), 4] - presence_abundance[pos, 4]) / (nrow(presence_abundance) - pos)
return(left - right)
}
presence_abundance$fo_diffs <- sapply(1:nrow(presence_abundance), fo_difference)
presence_abundance[is.na(presence_abundance)] <- 0
Let’s go ahead and plot the first-order differences against the cumulative percentage:
presence_abundance %>%
mutate(fo_diffs = fo_diffs / max(fo_diffs) * max(abundance)) %>%
ggplot(aes(x = reorder(gene, -abundance), group = 1)) +
theme_light() +
geom_line(aes(y = fo_diffs)) +
geom_line(aes(y = cumulative_percent), color = "red", linetype = "dashed") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
labs(x = "",
y = "Abundance",
title = "Abundance of representative genes in sample metagenomes")
All that’s left to do is to identify the cluster that achieves the maximal first-order difference. This can be done with the following:
(elbow <- which.max(presence_abundance$fo_diffs))
## [1] 6
Now we can add a vertical line to the plot with geom_vline
:
presence_abundance %>%
mutate(fo_diffs = fo_diffs / max(fo_diffs) * max(abundance)) %>%
ggplot(aes(x = reorder(gene, -abundance), group = 1)) +
theme_light() +
geom_line(aes(y = fo_diffs)) +
geom_line(aes(y = cumulative_percent), color = "red", linetype = "dashed") +
geom_vline(xintercept = elbow, color = "red") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
labs(x = "",
y = "Abundance",
title = "Abundance of representative genes in sample metagenomes")
geom_vline
?Now that we have decided on a cluster inclusion threshold, we can now prepare a fasta
file for input into a primer design software. So far, we have been working with the protein sequences. EcoFunPrimer, the primer design software that we are preparing this for, requires a fasta
file of nucleotide sequences. The file amoa_nuc.fa
contains the corresponding nucleotide sequences for each of the protein sequences we have. However, the sequence names are not the same between the two files. We have the mapping between the protein and nucleotide sequences in the protein_nucleotide_mapping.tsv
. Let’s take a look at it:
head data/sequence_info/protein_nucleotide_mapping.tsv
## prot_id nuc_id
## AAA66194 L08050
## AAB08985 U51630
## AAB16816 U72670
## AAB38709 U76553
## AAB38710 U76552
## AAB48015 U15733
## AAB48534 U89833
## AAB51760 U91603
## AAB53437 U92432
We know enough to extract the nucleotide sequence ID from the mapping file for each of the protein clusters that we’re including, and use that to search the amoa_nuc.fa
file for the sequences.
presence_abundance[1:6, ]
## gene abundance presence cumulative_percent fo_diffs
## 1 AAB38709 4100 266 1283.696 -90.84851
## 2 ABM54175 1846 254 1861.672 214.37724
## 3 AAC25057 1349 234 2284.040 270.82856
## 4 SEF68642 1249 255 2675.097 296.96094
## 5 KIO48008 1015 189 2992.890 300.83481
## 6 ADE13856 1007 114 3308.179 306.95911
presence_abundance
data frame to include all the sequences up until the inclusion threshold and extract the gene name column. Read the documentation for write_tsv
to write this out to a file called abundant_genes.tsv
. Write this without the column names.protein_nucleotide_mapping.tsv
for the corresponding nucleotide ID sequences. In other words, we want to grep protein_nucleotide_mapping.tsv
with multiple search terms. Note that this is the opposite situation from what we did at the beginning of this workshop where we searched multiple files for one search term. Look through the grep
documentation and find out which flag to use to use a file for input. Extract only the nucleotide sequence from this output.grep
command to search the amoa_nuc.fa
file for these sequences. This will give us the sequence headers, but we also want the actual sequences. Look through the grep
documentation (or search online) to figure out how to include a certain number of lines after a successful match. How many lines should we include afterwards? What are some considerations regarding fasta
file content that we have to keep in mind when doing this?And that’s it! Go ahead and save this to a file called abundant_genes.fa
.
Paul Villanueva
Ph.D. Student - Bioinformatics and Computational Biology
Iowa State University, Ames, IA.