Velvet is a De Bruijn graph assembler works fairly rapidly on short (microbial) genomes. In this tutorial we will use velvet to assemble an E. coli genome from simulated Illumina reads.
- Run velvet to perform de novo assembly on fragment, paired-end, and mate-paired data.
- Use contig_stats.pl to display assembly statistics.
- Find proteins of interest in an assembly using Blast.
Table of Contents
First, let's copy the fastq read files.
Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.
There are 4 sets of simulated reads:
400, 3000, 1500
25 for each subset
20 for each subset
Number of Subsets
Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.
Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).
Now let's use Velvet to assemble the reads.
First, you need to load Velvet via
Using velvet consists of a sequence of two commands:
velveth- analyzes kmers in the reads in preparation for assembly
velvetg- constructs the assembly and filters contigs from the graph
Look at the help for each program.
The <hash_length> parameter of
velveth is the kmer value that is key to the assembly process. Choosing it controls the tradeoff between sensitivity (lower hash_length, more reads included, longer contigs) and specificity (higher hash length, less chance of misassembly, more reads ignored, shorter contigs)). There is more discussion about choosing an appropriate kmer value in the Velvet manual and in this blog post.
Velvet has an option of specifying the insertion size of a paired read file (-ins_length). If no size is given, Velvet will guess the insertion size. We'll just have Velvet guess the size.
Velvet also has an option to specify the expected coverage of the genome, which helps it choose how to resolve repeated sequences (-exp_cov). We set this parameter to estimate this from the data. A common problem with using Velvet is that you have many very short contigs and the last line of output tells you that it used 0 of your reads. This is caused by not setting this parameter. The default is NOT auto.
We'll need to create a commands file and submit it to TACC. Let's make the commands file say:
launcher_creator.py to make a *.sge for the commands file and qsub it.
Use the correct "wayness"
Velvet and other assemblers usually take large amounts of RAM to complete. Running these 4 commands on a single node will use more RAM than is available on a single node so it's necessary to change the number of commands per node (wayness) from the default of 12 to 1. When you use
launcher_creator.py, you set the "wayness" (number of commands per node) using the
You can set the allotted time for this job to just 10 minutes.
If you are assembling large genomes or have high coverage depth data in the future, you will probably need to submit your jobs to the "largemem" queue.
If you find yourself waiting a long time for the assembly process to run, you can also start an idev session and try running some of the
velvetg commands interactively. Each one takes a few minutes to complete.
Explore each output directory that was created by Velvet.
Checking the tail of the Log files in each of the output folders, we see lines like the following:
With better read pairs that link more distant locations in the genome, there are fewer contigs, and contigs are are longer, giving us a more complete picture of linkage across the genome.
The complete E. coli genome is about 4.6 Mb. Why weren't we able to assemble it, even with this "perfect" data?
- Sometimes errors in reads lead to dead-ends in the graphs that are trimmed when they should not be.
- There are 7 nearly identical ribosomal RNA operons in E. coli spaced throughout the chromosome. Since each is >3000 bases, contigs cannot be connected across them using this data.
More assembly statistics: contig_stats.pl
The output file stats.txt contains information about every contig in the assembly, but it isn't sorted and can be a bit overwhelming.
You can generate some summary stats and graphs about each assembly using the
contig_stats.pl script that we have installed under
$BI/bin. Try figuring out how to do this on your own. You probably want to change into the directory of results for a specific assembly before running this command, since it generates several output files in the current working directory. You will need to copy PNG output back to your computer to view it.
What do I do now?
- Get a better assembly: maybe add a different library size, or go into a detailed genome completion project (commonly called "finishing") using a sequence assembly editor like
AMOS. (Be careful though, the amount of data in NGS data sets can be very difficult for these programs to deal with, since many were designed for Sanger sequencing reads.) You may have a lot of PCR products to make to close gaps and/or to order and orient scaffolds.
consedin particular makes this pretty easy, but it may still consume a lot more time and money than the initial shotgun assembly. You can identify some misassemblies by mapping the original reads to the assembly and then viewing them in IGV to look for discordant mate pairs, for example.
- Look for things: If you're just after a few homologs, an operon, etc. you're probably done. Most assemblers will be able to take 2x100 data and give you full gene sequences since these are non-repetitive and so assemble well. You can turn the contigs.fa into a blast database (
makeblastdbdepending on which version of blast you have) and start blasting away.
Exercise: Finding proteins in the assembly
Change into one of the Velvet result directories for the set of contigs that you want to analyze and load the blast module.
Find your favorite bacterial protein and see if it exists within the assembly.
Try NCBI's "Protein" database - search for Escherichia.
Once you have some query sequences you'd like to find in your assembly, turn your assembly into a local blast database and search for them:
Other paths from here:
- Predict genes/annotate the genome with de novo gene prediction tool like
maker, or one of the online gene prediction tools available at NCBI or JCVI.
- Use a variety of assembly evaluation tools (a rapidly growing field) - we'll talk about that more on the next page.
Advanced Exercise: A5 Haploid Microbial Genome Assembler
Another approach to try for haploid microbial genomes is the A5 pipeline.
It includes several additional quality control steps (like filtering adaptor contamination and low quality score portions of reads) that make sense when using a De Bruijn graph assembler and is optimized for haploid genomes. They're described in this paper.
Install the a5 pipeline by downloading from here and either 1) moving all of the items in the
bin directory into your local bin directory or adding the location of the
a5/bin directory to your
For more information, read the directions.