SPAdes is a De Bruijn graph assembler which has become the preferred assembler in numerous labs and workflows. In this tutorial we will use SPAdes to assemble an E. coli genome from simulated Illumina reads. Genome assembly is quite difficult (though if Oxford Nanopore lowers its error rate assembly will likely get much easier and involve new tools). Genome assembly should only be used when you can not find a reference genome that is close to your own, if you are engaged in metagenomic projects where you don't know what organisms may be present, and in situations where you believe you may have novel sequence insertions into a genome of interest (Note that in this case however you would actually want to grab reads that do not map to your reference genome (and their pair in the case of paired end and mate-pair sequencing) rather than performing these functions on the fastq files you get from the raw sequencing.
A note about read preprocessing
While not explicitly covered here, the presence of adapter sequences on reads when trying to assemble them can significantly complicate assembly and harm it. If using this tutorial on your own samples make sure you are working with the best data possible .. reads lacking adapters in this case.
For those looking for a real challenge, go through the multiqc tutorial and the trimmomatic tutorial, and use the information provided here to compare assemblies of some of the same samples in both cases.
- Run SPAdes to perform de novo assembly on fragment, paired-end, and mate-paired data.
- Use contig_stats.pl to display assembly statistics.
- Find proteins of interest in an assembly using Blast.
Unfortunately, SPAdes does not exist as a module for loading on TACC nor is it available in the BioITeam materials. As it is available through the SPAdes website as binaries, is well supported, and doesn't require complex dependancies making it easy to install.
In my opinion there are a few reasons:
- Generally speaking, while SPAdes is commonly used for assemblies, assemblies themselves are not very common as once you have an assembled genome, you use that genome for future analysis rather than redoing the assembly.
- Since it is easily installed, it doesn't save people much work to install it for them.
- As we have seen in a few of our other tutorials, things installed in the BioITeam are subject to upkeep by others and can break when modules or other programs are installed.
First, navigate to the SPAdes home page http://cab.spbu.ru/software/spades/ and download the linux binary distribution either directly to TACC using wget. While you could put the file anywhere on lonestar (and can easily move it around on lonestar with the mv command once it is there), I suggest downloading the file to a 'src' folder on $WORK as this is a good habit to get into.
Note that idev nodes have a tendency to download files from the internet much slower than the same file would download from the head node. If you are already in an idev node, it is likely faster to logout of the idev node (with the 'logout' command), execute the wget command listed below on the head node, and then start a new idev node after the download is complete.
Once the .tar.gz file has been placed in the $WORK/src folder using one of the above options, you need to extract the files.
Now that the files have been extracted you have a choice in how to use them: 1 option is to copy the binary files to a location that is already in your path (such as the $HOME/local/bin directory we set up for you in your .bashrc file), and the second option is to add the $WORK/src/SPAdes-3.13.0-Linux/bin folder to your path. This is a personal preference and I do not know how prevalent either choice is among other researchers. I know that my preference is to copy executable to known locations in the path rather than add a ton of different directories to my path, but others may feel differently. Below I present both options:
Doing both of the following may cause unintended effects in the future (particularly if you attempt to update the version of SPAdes you are using) and I do not recommend it.
If you have modified your PATH variable, you will need to log out of TACC and log back in before continuing.
Testing SPAdes installation
SPAdes comes with a self test option to make sure that the program is correctly installed. While this is not true of most programs, it is always a good idea to run whatever test data a program makes available rather than jumping straight into your own data as knowing there is an error in the program rather than your data makes troubleshooting very different.
Assuming everything goes correctly, there will be a large number of lines that pass pretty quickly with the last lines printed to the screen should being:
The lines immediately above this text list different output files and results from the assembly and will be true of all SPAdes runs and can be helpful for keeping track of where all your output ends up.
If the end of the spades test gives different output do not continue.
Get my attention on zoom and we'll figure out what is going on.
Set 1: Plasmid SPAdes
Unlike other times in the class where we are concerned about being good TACC citizens and not hurting other people by the programs we run, assembly programs are exceptionally memory intensive and attempting to run on the head node may result in the program returning a memory error rather than useable results. When it comes time to assemble your own reference genome, remember to give each sample its own compute node rather than having multiple samples split a single node.
Assembling even small bacterial genomes can be incredibly time intensive (as well as memory intensive as highlighted above). Fortunately for this class, we can make use of the plasmid spades option to assemble and even smaller plasmid genome that is ~2000 bp long in only a few minutes. I analyzing this data on an idev node and then submitting the other data analysis for the bacterial genomes as a job to run overnight.
Download the paired end fastq files which have had their adapters trimmed from the $BI/gva_course/Assembly/ directory.
Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively. For this tutorial lets use plasmidspades.py
The first option in the basic option is:
-o <output_dir> directory to store all the resulting files (required)
And we will need to supply the read files to the program. In this case we are looking for the following options:
-1 <filename> file with forward paired-end reads
-2 <filename> file with reverse paired-end reads
Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end using all 48 threads that are available (-t 48). The following command is expected to take less than 2 minutes.
Remember to make sure you are on an idev done
For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.
Evaluating the output
As you can see from listing the contents of the output 'plasmid' directory, there several new files generated. The 2 that I consider the most important are 1. contigs.fasta as this is the actual result of all the different contigs that were created. For circular chromosomes (such as plasmids) the goal would be that there is a single contig meaning that all of the reads were able to close the circle. The second useful file is spades.log as it has the information about the completed run that you can use to compare different samples or conditions in the event that you are interested trying to optimize the command options.
Looking at the contigs.fasta file can you answer the following questions (it is small enough to interrogate with cat or any other program)?
- How many contigs were generated?
just 1 (its a fasta file so you focus on the > symbols to identify each different contig that is present)
- How how long is each the contig?
- How deep is the coverage of this plasmid?
In this case the answer is ~180. This value can be particularly useful when you are trying to determine if novel DNA is present as a multi copy plasmid, or as something that has inserted into the chromosome. If it is inserted, you would expect the coverage to be similar to that of the chromosome, if it is a plasmid, it could be significantly higher.
Set 2: Whole Genome Simulated Data
Here we will look at 4 sets of data with library preparation conditions to evaluate how wet lab decisions influence outcomes on the computer
Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.
There are 4 sets of simulated reads:
400, 3000, 1500
25 for each subset
20 for each subset
Number of Subsets
Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs. This is not something you will encounter very often if at all.
Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).
Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.
The first option in the basic option is:
-o <output_dir> directory to store all the resulting files (required)
And we will need to supply the read files to the program. In our case we are looking for the following options:
--12 <filename> file with interlaced forward and reverse paired-end reads
-s <filename> file with unpaired reads
It would be more common for us to be using -1 and -2 for each of the paired end reads in normal situations rather than the -12 option, but as mentioned above this data is supplied to you as interleaved which many/most programs will accept, but require you to specify them differently
Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end using all 48 threads that are available (-t 48). The following command is expected to take between 60 and 70 minutes.
If you are planning to run a job overnight, consider adding additional combinations of reads as individual commands to get an idea of how different insert sizes can play a role in final contig lengths. Just remember that insert length should always increase:
Put all of the commands into a file named spades_commands. Be sure to ask for help if you are unsure how to use nano to do this.
A warning on memory usage
SPAdes (and most/all other assemblers) usually take large amounts of RAM to complete. Running these 3 commands on a single node at the same time will likely use more RAM than is available on a single node so it's necessary to run them sequentially or on their own node. This should also underscore to you that you should not run this on the head node. If you are assembling large genomes or have high coverage depth data in the future, you will probably need to submit your jobs to the "largemem" queue rather than the "normal" que.
Submitting the job
Once you have decided on the combinations you want to evaluate, use the '
wc -l' command to verify that your spades_commands file has as many commands as you expected, and then replace the ? in the following block with the output of that command
Evaluating the output
Explore each output directory that was created for each set of reads you interrogated. The actionable information is in the contigs.fasta file. The contig file is a fasta file ordered based on the length of the individual contig in decreasing order. The names of each individual contig lists the number of the contig (largest contig being named NODE_1 next largest being named NODE_2 and so on) followed by the length of the contig, and the coverage (labeled as cov on the line). Generally, the lower number of total contigs and the larger the length of each are regarded as better assemblies, but the number of chromosomes present in the organism is an important factor as well.
The grep command can be quite useful for isolating the names of the contigs with the information, especially when combined with the -c option to count the total number of contigs, or piping the reults to head/tail or both head and tail to isolate the top/bottom contigs.
If you ran multiple different combinations of reads for the simulated data how did the insert size effect the number of contigs? the length of the largest contigs? Why might larger insert sizes not help things very much?
The length of repetitive elements in the genome plays a large role in how easily it can be assembled as large repeats need even larger insert sizes to be spanned by single read pairs.
The complete E. coli genome is about 4.6 Mb. Why weren't we able to assemble it, even with the "perfect" simulated data?
There are 7 nearly identical ribosomal RNA operons in E. coli spaced throughout the chromosome. Since each is >3000 bases, contigs cannot be connected across them using this data.
What comes next when working with your own data?
- Look for things: If you're just after a few homologs, an operon, etc. you're probably done. Think about what question you are trying to answer.
- You can turn the contigs.fa into a blast database (
makeblastdbdepending on which version of blast you have) or try multiple sequence alignments through NCBIs blast.
- If you built your contigs based on a normal/control sample you can map other reads to the contigs using bowtie2 to try to identify variants in other samples.
- If you don't think the contigs you have are "good enough"
- Try using Spades MismatchCorrector to see if you can improve the contigs you already have.
- Add additional sequencing libraries to try to connect some more contigs. Especially think about pacbio sequencing and oxford nanopore.