Genome Assembly

First, some background

De Novo asssembly is creating a genome without a reference genome. Creating a genome with a reference genome is called mapping assembly.

This paper is an excellent review of the theory and practice of NGS assemblers as of 2010. Read lengths will continue to get longer, error rates lower, coverage higher, but the basic concepts embodied in that paper will probably remain useful for several more years.

The figures embedded in this wiki page for educational purposes are from that paper.

Upfront we need to discuss the two basic assembler types: overlap graph and de Bruijn:

In either case, more and longer reads are better as you can imagine. With an overlap graph (also called overlap layout consensus algorithm or overlap layout algorithm) your assembly grows much more effectively with longer reads and there are few parameters you can tweak. With a de Bruijn approach, obviously your choice of k can have a strong impact on your assembly.

Effect of trade-off in read length and coverage

k-mer distributions inherent in select genomes

Some example assembly statistics

Many (many) assemblers are available. A list of assemblers can be found here.

We'll take a look at Velvet. - it's a fast and easy to use de Bruijn assembler.

OK - let's try an exercise on the next wiki page