SPAdes is a De Bruijn graph assembler which has become the preferred assembler in numerous labs and workflows. In this tutorial we will use SPAdes to assemble an E. coli genome from simulated Illumina reads. Genome assembly is quite difficult (though as Oxford Nanopore lowers its error rate and tools using both its long reads and illumina short reads the difficulty falls, while the accuracy increases). Genome assembly should only be used:
- When you can not find a reference genome that is close to your own.
- If you are engaged in metagenomic projects where you don't know what organisms may be present.
- There are other tools that can be useful for this type of work.
- In situations where you believe you may have novel sequence insertions into a genome of interest.
- Note that in this case however you might actually want to grab reads that do not map to your reference genome (and their pair in the case of paired end and mate-pair sequencing) rather than performing these functions on the fastq files you get from the raw sequencing.
A note about read preprocessing
While not explicitly covered here, the presence of adapter sequences on reads when trying to assemble them can significantly complicate assembly and decrease the accuracy. If using this tutorial on your own samples make sure you are working with the best data possible ... reads lacking adapters in this case with the largest insert sizes possible.
- Run SPAdes to perform de novo assembly on fragment, paired-end, and mate-paired data.
- Use contig_stats.pl to display assembly statistics.
- Find proteins of interest in an assembly using Blast.
As genome assembly is important part of analysis but is building a reference file that will be used many times, it makes more sense to install it its own environment. Other potential tools to have in the same environment would be read preprocessing tools, in particular adapter removal tools such as fastp. Supporting the suggestion made in the fastp tutorial that if environments are to be grouped together based on task, read pre-processing is a good environment
Testing SPAdes installation
SPAdes comes with a self test option to make sure that the program is correctly installed. While this is not true of most programs, it is always a good idea to run whatever test data a program makes available rather than jumping straight into your own data as knowing there is an error in the program rather than your data makes troubleshooting very different.
Assuming everything goes correctly, there will be a large number of lines that pass pretty quickly with the last lines printed to the screen should being:
The lines immediately above this text list different output files and results from the assembly and will be true of all SPAdes runs and can be helpful for keeping track of where all your output ends up. And then a version response of:
Since we didn't set any options, and only ran the prepackaged tests, ignoring the warning seems highly reasonable. If we got a similar warning with our own samples, rerunning the analysis and comparing the 2 results would be a good use of our time.
If the end of the spades test gives different output do not continue.
Get my attention on zoom and we'll figure out what is going on.
Set 1: Plasmid SPAdes
Unlike other times in the class where we are concerned about being good TACC citizens and not hurting other people by the programs we run, assembly programs are exceptionally memory intensive and attempting to run on the head node may result in the program returning a memory error rather than useable results. When it comes time to assemble your own reference genome, remember to give each sample its own compute node rather than having multiple samples split a single node. If you still run into memory problems, consider moving onto the 'large-mem' queue rather than the 'normal' queue which has more memory, and also downsampling your data.
Assembling even small bacterial genomes can be incredibly time intensive (as well as memory intensive as highlighted above). Fortunately for this class, we can make use of the plasmid spades option to assemble and even smaller plasmid genome that is ~2000 bp long in only a few minutes. I suggest analyzing this data on an idev node and then submitting the other data analysis for the bacterial genomes as a job to run overnight.
Download the paired end fastq files which have had their adapters trimmed from the $BI/gva_course/Assembly/ directory.
Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively. For this tutorial lets use plasmidspades.py
The first option in the basic option is:
-o <output_dir> directory to store all the resulting files (required)
And we will need to supply the read files to the program. In this case we are looking for the following options:
-1 <filename> file with forward paired-end reads
-2 <filename> file with reverse paired-end reads
Once you have figured out what options you need to use see if you can come up with a command to run on the paired end reads and have the output go into a new directory called plasmid using all 68 cores that are available on your idev node (-t 68). The following command is expected to take less than 2 minutes.
Remember to make sure you are on an idev done
For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.
Evaluating the output
As you can see from listing the contents of the output 'plasmid' directory, several new files have been generated. There are two files that I consider to be the most important. 1. contigs.fasta as this is the actual result of all the different contigs that were created. For circular chromosomes (such as plasmids) the goal would be that there is a single contig meaning that all of the reads were able to close the circle. 2. spades.log as it has the information about the completed run that you can use to compare different samples or conditions in the event that you are interested trying to optimize the command options, as would likely be the case if you were trying to assemble the best reference possible. Interestingly, the spades.log file is equivalent to if you had redirected the error and screen printing to a log file yourself (ie using &> as was done in the fastp tutorial).
Looking at the contigs.fasta file can you answer the following questions? (it is small enough to interrogate with cat or any other program)
- How many contigs were generated?
just 1 (its a fasta file so you focus on the > symbols to identify each different contig that is present)
- How how long is each the contig?
- How deep is the coverage of this plasmid?
In this case the answer is ~57. This value can be particularly useful when you are trying to determine if novel DNA is present as a multi copy plasmid, or as something that has inserted into the chromosome. If it is inserted, you would expect the coverage to be similar to that of the chromosome, if it is a plasmid, it could be significantly higher.
Visualizing the aseemebly
Another file that maybe of interest is (especially if you are going to try to manually make improvements to the assembly or take a targeted approach to improving the assembly) the assembly_graph.fastg. I would recommend opening this file with the bandage program. https://rrwick.github.io/Bandage/ it is lightweight and easily installed on all systems and while it is pretty intuitive it does have robust documentation https://github.com/rrwick/Bandage/wiki. Viewing this plasmid in bandage will effectively just show you a circle as it is completely closed. The good news is that bandage is powerful enough to support larger genomes which may be of help or interest in the simulated data set.
Set 2: Whole Genome Simulated Data
Here we will look at 4 sets of data with library preparation conditions to evaluate how wet lab decisions influence outcomes on the computer. Some of the text here is very similar or identical to that in set 1 incase people choose to skip directly to it.
Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.
There are 4 sets of simulated reads:
400, 3000, 1500
25 for each subset
20 for each subset
Number of Subsets
Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs. This is not something you will encounter very often if at all.
And your expected output is:
Notice how the pairs of reads are denoted by the /1 and /2 at the end of the first line in the 4 line fastq block. More often (and everywhere else in this course) your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).
Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.
The first option in the basic option is:
-o <output_dir> directory to store all the resulting files (required)
And we will need to supply the read files to the program. In our case we are looking for the following options:
--12 <filename> file with interlaced forward and reverse paired-end reads
-s <filename> file with unpaired reads
It would be more common for us to be using -1 and -2 for each of the paired end reads in normal situations rather than the -12 option, but as mentioned above this data is supplied to you as interleaved which many/most programs will accept, but require you to specify them differently
Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end using all 68 threads that are available (-t 68).
Consider adding a few more commands to show the effect of increasing fragment size, and be sure to give them their own output name:
Put all 4 of the commands into a file named spades_commands. Be sure to ask for help if you are unsure how to use nano to do this.
A warning on memory usage
SPAdes (and most/all other assemblers) usually take large amounts of RAM to complete. Running these 4 commands on a single node at the same time will likely use more RAM than is available on a single node so it's necessary to run them sequentially or on their own node. This should also underscore to you that you should not run this on the head node. If you are assembling large genomes or have high coverage depth data in the future, you will probably need to submit your jobs to the "largemem" queue rather than the "normal" que and may need to downsample your data.
Submitting the job
Once you have decided on the combinations you want to evaluate, use the '
wc -l' command to verify that your spades_commands file has 4 commands as you expect.
As we have seen in other tutorials involving the job queue system, we need a slurm file and need to modify it according to what we are actually trying to run.
Again while in nano you will edit most of the same lines you edited in the in the breseq tutorial. Note that most of these lines have additional text to the right of the line. This commented text is present to help remind you what goes on each line, leaving it alone will not hurt anything, removing it may make it more difficult for you to remember what the purpose of the line is
|Line number||As is||To be|
#SBATCH -J jobName
|#SBATCH -J spades|
#SBATCH -n 1
#SBATCH -n 4
#SBATCH -N 1
#SBATCH -N 4
#SBATCH -t 12:00:00
#SBATCH -t 4:30:00
The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.
Again use ctl-o and ctl-x to save the file and exit.
Evaluating the output
Explore each output directory that was created for each set of reads you interrogated. The actionable information is in the contigs.fasta file. The contig file is a fasta file ordered based on the length of the individual contig in decreasing order. The names of each individual contig lists the number of the contig (largest contig being named NODE_1 next largest being named NODE_2 and so on) followed by the length of the contig, and the coverage (labeled as cov on the line). Generally, the lower number of total contigs and the larger the length of each are regarded as better assemblies, but the number of chromosomes present in the organism is an important factor as well.
The grep command can be quite useful for isolating the names of the contigs with the information, especially when combined with the -c option to count the total number of contigs, or piping the results to head/tail or both head and tail to isolate the top/bottom contigs.
Since you ran multiple different combinations of reads for the simulated data how did the insert size effect the number of contigs? the length of the largest contigs? Why might larger insert sizes not help things very much?
The length of repetitive elements in the genome plays a large role in how easily it can be assembled as large repeats need even larger insert sizes to be spanned by single read pairs.
The complete E. coli genome is about 4.6 Mb. Why weren't we able to assemble it, even with the "perfect" simulated data?
There are 7 nearly identical ribosomal RNA operons in E. coli spaced throughout the chromosome. Since each is >3000 bases, contigs cannot be connected across them using this data. For bacteria there is an interesting observation that the majority of chromosomes require fragments of ~7kb to be fully closed.
Visualizing the assembly
What comes next when working with your own data?
- Look for things: If you're just after a few homologs, an operon, etc. you're probably done. Think about what question you are trying to answer.
- You can turn the contigs.fa into a blast database (
makeblastdbdepending on which version of blast you have) or try multiple sequence alignments through NCBIs blast.
- If you built your contigs based on a normal/control sample you can map other reads to the contigs using bowtie2 to try to identify variants in other samples.
- If you don't think the contigs you have are "good enough"
- Verify you have trimmed your reads to the best they can be using fastq, multiqc, and fastp
- Try using Spades MismatchCorrector to see if you can improve the contigs you already have.
- Add additional sequencing libraries to try to connect some more contigs. Especially think about pacbio sequencing and oxford nanopore.