I’m currently developing a primer design pipeline for the GERMS lab. The pipeline takes as input a fasta
file containing the genes of interest and a directory of metagenomes. The pipeline then clusters the genes, identifies which genes are abundant in the submitted metagenomes, and designs primers for those sequences. See below for an overview of the process:
Today, we will working on the Filter step. This step is comprised of the following:
Querying the sequences against all the metagenomes takes more computational resources and time than are available to us today, so we will only be reproducing steps 1, 3, and 4.
The first step in the filtering process is to choose an optimum clustering threshold for our sequences. Sequences don’t have to be 100% similar to code for the same function, so by clustering the genes together at a lower similarity threshold may allow us to hit all of our gene targets with fewer probes.
To perform the clustering, we have used the CD-Hit program. CD-Hit accepts as input a fasta
file and a similarity threshold and sorts the sequences into clusters. The output from this program are in data/cluster_information
folder. They are of form x.xx.fa
and x.xx.fa.clstr
, where:
x.xx
is the similarity threshold.fa.clstr
contains the clustering information.fa
contains the sequence of each cluster representativeNow that we have all the clustering information from CD-Hit for our target sequences of varying thresholds, our job is to collect all of this information, use it to select an optimal similarity threshold, and collect the representative sequences for querying against our metagenomes.
Let’s begin with a review of grep
.
The grep
function searches through a file for a search term. It has the following basic syntax
grep <flags> search_term file
For example, if we wanted to find if the word “summer” was in summer_days.txt
, we would do the following:
grep "summer" data/practice_data/summer_days.txt
## Summer days (like the summer days)
## Summer days (like the summer days)
## Summer days (like the summer days)
## Summer days (like the summer days)
This shows us the four lines that the word “summer” was found on. Let’s see what happens when we look for a word that isn’t there:
grep "winter" data/practice_data/summer_days.txt
Nothing pops up, as expected.
grep
has a ton of flags that we can pass at the command line, only a few of which we’ll cover today. For example, we can pass the -v
flag to invert the match, returning every line on which the match wasn’t found. So, to find all the lines on which “summer” doesn’t appear, we would do:
grep -v "summer" data/practice_data/summer_days.txt
## Summer Days - Sebastian Yatra
## I don't wanna let you go
## I don't wanna let you go
## Why you wanna break my heart?
## Try to be honest
## We can make it on our own
## We can make it on our own
## You just gotta give me time
## Just stay here through August
##
## Please, believe in me
## Like your wildest dreams
## Like your wildest dreams
## Forever free
## Sail away with me
## Don't leave
##
## Dance on the ocean
## They be playing our song
## Hold on to meet one more time
## Dance on the ocean
## Don't kiss me goodbye
## Don't know what you think you will find
##
## Summer days
## Summer days
##
## Yatra, yatra
## Fue culpa del destino y te quedaste sola
## Te puso en mi camino y nos liego la hora
## Contigo a media noche quiero estar a solas
## Lo que tu quieras te haré
## Mi corazón se estalla tocandote en la playa
## Sintiendo que la arena se convierta en amor
## Y un beso es tan profundo
## Que para los segundos
## Sintiendo que el verano se convierta en amor
##
## Dance on the ocean
## They be playing our song
## Hold on to meet one more time
## Dance on the ocean
## Don't kiss me goodbye
## Don't know what you think you will find
##
## Summer days
## Summer days
##
## Dance on the ocean
## They be playing our song
## Hold on to meet one more time
## Dance on the ocean
## Don't kiss me goodbye
## Don't know what you think you will find
##
##
## Summer days
man
page (via man grep
) and look for the flag to make the grep
search case-insensitive.|
operator to pipe the output from the above search to wc
. How many lines was summer found on? Is there a smarter way to do this? Hint: check the man
page.By default, grep
only outputs the lines on which the match was found. This can lead to situations where using the -c
flag to count occurrences of a term is inaccurate. For example, let’s find all the lines on which the word “sunday” appears in sunday.txt
:
grep -i sunday data/practice_data/sunday.txt
## Sunday (feat. Heize, Jay Park) - GroovyRoom
## Sunday, Sunday
## Sunday, Sunday
## Sunday, Sunday
## Sunday, the day I look forward to the most
## Sunday, Sunday
## Sunday, Sunday
## Sunday, Sunday
## My Monday to Sunday
## Sunday, Sunday
## Sunday, Sunday
## Sunday, Sunday
Now, let’s see what -c
says:
grep -ic sunday data/practice_data/sunday.txt
## 12
We can correct this by using the -o
flag, which tells grep
to output each occurrence of the search term on a separate line:
grep -io sunday data/practice_data/sunday.txt
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
## Sunday
From here, we can pipe to wc
to get the number of occurrences:
grep -io sunday data/practice_data/sunday.txt | wc -l
## 21
monday.txt
? How many occurrences of “Monday” are there?fasta
filesNow that we have a bit of practice with grep
, let’s return to the problem at hand. The data/sequence_info
directory has the nucleotide and protein sequences for our genes of interest. Let’s take a look at them:
head data/sequence_info/amoa_prot.fa
## >BAA92239 coded_by=987..1817,organism=Nitrosomonas sp. TK794,definition=ammonia monooxygenase a
## msifrteeilkaakmppeavhmsrlidavyfpilvvllvgtyhmhfmllagdwdfwmdwkdrqwwpvvtpivgitycsaimyylwvnyrqpfgatlcvvclligewltrywgfywwshyplnfvtpgimlpgalmldftlyltrnwlvtalvgggffglmfxpgnwpifgpthlpivvegtllsmadymghlyvrtgtpeyvrhieqgslrtfgghttviaaffaafvsmlmfavwwylgkvyctaffyvkgkrgrivqrndvtafgeegfpegik
## >BAB84331 coded_by=1073..1861,organism=Nitrosomonas sp. ENI-11,definition=ammonia monooxygenase subunit A2
## mppeavhmsrlidavyfpilivllvgtyhmhfmllagdwdfwmdwkdrqwwpvvtpivgitycsaimyylwvnyrqpfgatlcvvclligewltrywgfywwshypinfvtpgimlpgalmldftmyltrnwlvtalvgggffgvlfypgnwpifgpthlpivvegtllsmadymghmyvrtgtpeyvrhieqgslrtfgghttviaaffsafvsmlmftvwwylgkvyctaffyvkgkrgrivhrndvtafgeegfpegik
## >BAB84328 coded_by=1073..1861,organism=Nitrosomonas sp. ENI-11,definition=ammonia monooxygenase subunit A-1
## mppeavhmsrlidavyfpilivllvgtyhmhfmllagdwdfwmdwkdrqwwpvvtpivgitycsaimyylwvnyrqpfgatlcvvclligewltrywgfywwshypinfvtpgimlpgalmldftmyltrnwlvtalvgggffgvlfypgnwpifgpthlpivvegtllsmadymghmyvrtgtpeyvrhieqgslrtfgghttviaaffsafvsmlmftvwwylgkvyctaffyvkgkrgrivhrndvtafgeegfpegik
## >SET51056 coded_by=1286..2116,organism=Nitrosomonas europaea,definition=ammonia monooxygenase subunit A
## msifrteeilkaakmppeavhmsrlidavyfpiliillvgtyhmhfmllagdwdfwmdwkdrqwwpvvtpivgitycsaimyylwvnyrqpfgatlcvvclligewltrywgfywwshypinfvtpgimlpgalmldftlyltrnwlvtalvgggffgllfypgnwpifgpthlpivvegtllsmadymghlyvrtgtpeyvrhieqgslrtfgghttviaaffsafvsmlmftvwwylgkvyctaffyvkgkrgrivhrndvtafgeegfpegik
## >SDW98167 coded_by=1286..2116,organism=Nitrosomonas europaea,definition=ammonia monooxygenase subunit A
## msifrteeilkaakmppeavhmsrlidavyfpiliillvgtyhmhfmllagdwdfwmdwkdrqwwpvvtpivgitycsaimyylwvnyrqpfgatlcvvclligewltrywgfywwshypinfvtpgimlpgalmldftlyltrnwlvtalvgggffgllfypgnwpifgpthlpivvegtllsmadymghlyvrtgtpeyvrhieqgslrtfgghttviaaffsafvsmlmftvwwylgkvyctaffyvkgkrgrivhrndvtafgeegfpegik
head data/sequence_info/amoa_nuc.fa
## >AB031869 location=987..1817,organism=Nitrosomonas sp. TK794,definition=ammonia monooxygenase a
## gtgagtatatttagaacagaagagatcctgaaagcggccaagatgccgccggaagcggtccatatgtcacgcctgattgatgcggtttattttccgattctggttgttctgttggtaggtacctaccatatgcacttcatgttgttggcaggtgactgggatttctggatggactggaaagatcgtcaatggtggcctgtagtaacacctattgtaggcattacctattgctcggcaattatgtattacctgtgggtcaactaccgtcaaccatttggtgcgactctgtgcgtagtgtgtttgctgataggtgagtggctgacacgttactggggtttctactggtggtcacactatccactcaattttgtaaccccaggtatcatgctcccgggtgcattgatgttggatttcacgctgtatctgacacgtaactggttggtgactgcattggttgggggtggattctttggcctgatgtttnnnccaggtaactggccaatctttggcccgacccatctgccaatcgttgtagaaggaacactgttgtcgatggctgactacatgggtcacctgtatgttcgtacgggtacacctgagtatgttcgtcatattgaacaaggttcattacgtacctttggtggtcacaccacagttattgcggcattcttcgctgcgtttgtatccatgctgatgtttgcagtctggtggtatcttggaaaagtttactgcacagccttcttctacgttaaaggtaaaagaggacgtattgtacagcgcaatgatgttacggcatttggtgaagaagggtttccagaggggatcaaa
## >AB079055 location=1073..1861,organism=Nitrosomonas sp. ENI-11,definition=ammonia monooxygenase subunit A2
## atgccaccggaagcggttcatatgtcacgtctgattgacgcggtttattttccaattctgattgttttgctggtaggtacctaccatatgcacttcatgctgctggcaggtgactgggatttctggatggactggaaagatcgtcaatggtggccggttgtgacaccgatcgtaggaattacctactgctcggcaatcatgtattacctgtgggtcaactaccgccaaccgtttggtgcaacactgtgtgtagtgtgtctgttgattggtgaatggctgacacgttactggggattctactggtggtcacactatcctatcaatttcgtaacaccgggtattatgcttccaggtgcgctgatgctggatttcacaatgtatctgacacgcaactggctggtgacggctctggttgggggtggattcttcggtgtgctgttctatccgggtaactggccgatctttggtccgacacatctgccaatcgtcgtagaaggaacactgttgtcgatggctgactacatgggccatatgtatgttcgtacgggtacacctgagtatgttcgtcatattgagcaaggttcactgcgtacctttggtggacataccacagttattgcagcgttcttctctgcattcgtatcaatgttgatgttcacagtatggtggtatcttggaaaagtttactgtacagcctttttctacgttaaaggtaaaagaggtcgtatcgtacatcgcaatgatgttaccgcattcggtgaagaaggctttccagaggggatcaaa
## >AB079054 location=1073..1861,organism=Nitrosomonas sp. ENI-11,definition=ammonia monooxygenase subunit A-1
## atgccaccggaagcggttcatatgtcacgtctgattgacgcggtttattttccaattctgattgttttgctggtaggtacctaccatatgcacttcatgctgctggcaggtgactgggatttctggatggactggaaagatcgtcaatggtggccggttgtgacaccgatcgtaggaattacctactgctcggcaatcatgtattacctgtgggtcaactaccgccaaccgtttggtgcaacactgtgtgtagtgtgtctgttgattggtgaatggctgacacgttactggggattctactggtggtcacactatcctatcaatttcgtaacaccgggtattatgcttccaggtgcgctgatgctggatttcacaatgtatctgacacgcaactggctggtgacggctctggttgggggtggattcttcggtgtgctgttctatccgggtaactggccgatctttggtccgacacatctgccaatcgtcgtagaaggaacactgttgtcgatggctgactacatgggccatatgtatgttcgtacgggtacacctgagtatgttcgtcatattgagcaaggttcactgcgtacctttggtggacataccacagttattgcagcgttcttctctgcattcgtatcaatgttgatgttcacagtatggtggtatcttggaaaagtttactgtacagcctttttctacgttaaaggtaaaagaggtcgtatcgtacatcgcaatgatgttaccgcattcggtgaagaaggctttccagaggggatcaaa
## >FOID01000069 location=1286..2116,organism=Nitrosomonas europaea,definition=ammonia monooxygenase subunit A
## gtgagtatatttagaacggaagaaatcctgaaagcggccaagatgccgccggaagcggttcatatgtcacgtctgattgatgcagtttattttccaattctgattattttgctggtgggtacctaccacatgcactttatgctgctggcaggtgactgggatttctggatggactggaaagatcgtcaatggtggccggttgtaacgccaatcgtggggatcacctactgttcggcaatcatgtattacttgtgggtcaactaccgccaaccgtttggtgcaacgttgtgtgtggtgtgtctgctgattggtgagtggctgacacgttactggggattctactggtggtcacactaccccatcaacttcgtaacaccgggcattatgcttccgggtgcgctgatgctggacttcacgctgtatctgacacgcaactggctggtgacggctctggttggaggtggattcttcggtctgctgttctatccgggtaactggccgatttttggaccaacccatttgccaatcgttgtagaaggcacattgctgtcgatggctgattacatgggacatctgtatgttcgtacaggtacacccgagtatgttcgtcatattgagcaaggttcactgcgtacctttggtggtcataccacagttattgcagcattcttctctgcgttcgtatcaatgttgatgttcaccgtatggtggtatcttggaaaagtttactgtacagcctttttctacgttaaaggtaaaagaggtcgtatcgtacatcgcaatgatgttaccgcattcggtgaagaaggctttccagaggggatcaaa
## >FNNX01000068 location=1286..2116,organism=Nitrosomonas europaea,definition=ammonia monooxygenase subunit A
## gtgagtatatttagaacggaagaaatcctgaaagcggccaagatgccgccggaagcggttcatatgtcacgtctgattgatgcagtttattttccaattctgattattttgctggtgggtacctaccacatgcactttatgctgctggcaggtgactgggatttctggatggactggaaagatcgtcaatggtggccggttgtaacgccaatcgtggggatcacctactgttcggcaatcatgtattacttgtgggtcaactaccgccaaccgtttggtgcaacgttgtgtgtggtgtgtctgctgattggtgagtggctgacacgttactggggattctactggtggtcacactaccccatcaacttcgtaacaccgggcattatgcttccgggtgcgctgatgctggacttcacgctgtatctgacacgcaactggctggtgacggctctggttggaggtggattcttcggtctgctgttctatccgggtaactggccgatttttggaccaacccatttgccaatcgttgtagaaggcacattgctgtcgatggctgattacatgggacatctgtatgttcgtacaggtacacccgagtatgttcgtcatattgagcaaggttcactgcgtacctttggtggtcataccacagttattgcagcattcttctctgcgttcgtatcaatgttgatgttcaccgtatggtggtatcttggaaaagtttactgtacagcctttttctacgttaaaggtaaaagaggtcgtatcgtacatcgcaatgatgttaccgcattcggtgaagaaggctttccagaggggatcaaa
amoa_nuc.fa
and amoa_prot.fa
files represent nucleotide and protein sequences of the same gene. What are some sanity checks you can do to make sure these files at least look correct?fasta
files that you can use to count the number of sequences in a file?.fa.clstr
files. How are those files structured? How would you count the number of clusters in a single file? How can you expand this command to do this over all the files?cluster_counts.tsv
:## 0.80 3
## 0.81 3
## 0.82 4
## 0.83 5
## 0.84 5
## 0.85 6
## 0.86 8
## 0.87 8
## 0.88 11
## 0.89 11
## 0.90 14
## 0.91 20
## 0.92 22
## 0.93 25
## 0.94 27
## 0.95 32
## 0.96 60
## 0.97 96
## 0.98 174
## 0.99 355
## 1.00 760
So we now have the clustering information for the different similarity thresholds. To decide the threshold at which we’re going to cluster, we’re going to choose the point on this curve that maximizes the first-order difference (see here for more info). Writing this code is outside the scope of this class, so for now, copy and paste the lines below into your terminal:
cat data/cluster_information/cluster_counts.tsv | awk -F "\t" 'BEGIN {OFS="\t"} NR==1{max==2} {left = (max - $2) / NR; NR==20 ? right = 0: right = (-$2) / (20 - NR) ; print $1 OFS $2 OFS left - right}' > temp
mv temp cluster_counts.tsv
What the code above did was calculate the first-order differences at each point along the similarity curve, saved it to a temporary file temp
, and overwrote the original cluster_counts.tsv
.
Before we go any further, let’s review cut
and sort
.
The cut
command returns the indicated columns from your file. For this example, we’re going to work with the tft_stats.tsv
file. tft_stats.tsv
(adapted from here) shows the distribution of the player base across the various rankings of Teamfight Tactics. Let’s see what it looks like:
cat data/practice_data/tft_stats.tsv
## Iron IV 0.00
## Iron III 0.08
## Iron II 6.61
## Iron I 14.59
## Bronze IV 17.37
## Bronze III 14.47
## Bronze II 10.56
## Bronze I 7.74
## Silver IV 6.72
## Silver III 5.02
## Silver II 4.02
## Silver I 2.91
## Gold IV 3.15
## Gold III 2.18
## Gold II 1.58
## Gold I 0.96
## Platinum IV 1.07
## Platinum III 0.44
## Platinum II 0.23
## Platinum I 0.11
## Diamond IV 0.12
## Diamond III 0.04
## Diamond II 0.02
## Diamond I 0.01
## Master - 0.00
## GrandMaster - 0
## Challenger - 0
A call of the cut
function looks like:
cut -d <char> -f <n> <file>
where:
file
is your input file-d
is the delimiter; cut
will separate the columns of file
based on this argument-f
is the column you wantBy default, cut
assumes that the file is tab-delimited. Since tft_stats.tsv
is tab-delimited, we can go ahead and pull, say, just the first column:
cut -f 1 data/practice_data/tft_stats.tsv
## Iron
## Iron
## Iron
## Iron
## Bronze
## Bronze
## Bronze
## Bronze
## Silver
## Silver
## Silver
## Silver
## Gold
## Gold
## Gold
## Gold
## Platinum
## Platinum
## Platinum
## Platinum
## Diamond
## Diamond
## Diamond
## Diamond
## Master
## GrandMaster
## Challenger
If we wanted the first two columns, we can do:
cut -f 1,2 data/practice_data/tft_stats.tsv
## Iron IV
## Iron III
## Iron II
## Iron I
## Bronze IV
## Bronze III
## Bronze II
## Bronze I
## Silver IV
## Silver III
## Silver II
## Silver I
## Gold IV
## Gold III
## Gold II
## Gold I
## Platinum IV
## Platinum III
## Platinum II
## Platinum I
## Diamond IV
## Diamond III
## Diamond II
## Diamond I
## Master -
## GrandMaster -
## Challenger -
and etc. Another usual function is sort
. This function will sort a file by the indicated column.
A call to the sort
function looks like:
cut -t <char> -k <n> <file>
where:
file
is your input file-t
is the delimiter; sort
will separate the columns of file
based on this argument-k
is the column you wish to sort onLet’s try sorting the file based on the third column:
sort -k 3 data/practice_data/tft_stats.tsv
## Challenger - 0
## GrandMaster - 0
## Iron IV 0.00
## Master - 0.00
## Diamond I 0.01
## Diamond II 0.02
## Diamond III 0.04
## Iron III 0.08
## Platinum I 0.11
## Diamond IV 0.12
## Platinum II 0.23
## Platinum III 0.44
## Gold I 0.96
## Platinum IV 1.07
## Gold II 1.58
## Bronze II 10.56
## Bronze III 14.47
## Iron I 14.59
## Bronze IV 17.37
## Gold III 2.18
## Silver I 2.91
## Gold IV 3.15
## Silver II 4.02
## Silver III 5.02
## Iron II 6.61
## Silver IV 6.72
## Bronze I 7.74
sort
will sort the column based on string value. Look up the flag to sort based on numeric value. Look up the flag to reverse the sorting. Sort tft_stats.tsv
based on the third column. Which division and rank had the most players in it?ephraims_cards.tsv
is a list of my friend’s cards, their market value, the quantity he has of each, and the total combined. Use cut
to access the third column and the fourth column. Use sort
to find out which card has the most value, then use it again to find out which combination of card and quantity has the most value.cluster_counts.tsv
. Recall that we are trying to determine the optimum threshold at which to cluster our sequences, and that the criterion we’re using is the first-order difference. Which similarity threshold should we cluster our sequences at?
Paul Villanueva
Ph.D. Student - Bioinformatics and Computational Biology
Iowa State University, Ames, IA.