Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

After raw sequence files are generated (in FASTQ format), quality-checked, and pre-processed in some way, the next step in most NGS pipelines is mapping to a reference genome. For individual sequences, it is common to use a tool like BLAST to identify genes or species of origin. However, a normal NGS dataset will have tens to hundreds of millions of sequences, which BLAST and similar tools are not designed to handle.

...

In this section, we will primarily focus on two of the most versatile mappers: BWA and Bowtie2, the latter being part of the Tuxedo suite (e.g. transcriptome-aware Tophat2) which also includes tools for manipulating NGS data after alignment.

Connect to

...

login5.

...

ls5.tacc.utexas.edu

This should be second nature by now (smile)

...

We first moved ourselves into the fastq directory, then copied over all files that end in "fastq.gz" in the corral directory specified in the second line.  These are descriptions of the five files we copied:

File NameDescriptionSample
Sample_Yeast_L005_R1.cat.fastq.gzPaired-end Illumina, First of pair, FASTQYeast ChIP-seq
Sample_Yeast_L005_R2.cat.fastq.gzPaired-end Illumina, Second of pair, FASTQYeast ChIP-seq
human_rnaseq.fastq.gzPaired-end Illumina, First of pair only, FASTQHuman RNA-seq
human_mirnaseq.fastq.gzSingle-end Illumina, FASTQHuman microRNA-seq
cholera_rnaseq.fastq.gzSingle-end Illumina, FASTQV. cholerae RNA-seq

Reference Genomes

Before we get to alignment, we need a genome to align to.  We will use four different references here: 

...

These are the four reference genomes we will be using today, with some information about them (and here is information about many more genomes):

ReferenceSpeciesBase LengthContig NumberSourceDownload
hg19Human3.1 Gbp25 (really 93)UCSCUCSC GoldenPath
sacCer3Yeast12.2 Mbp17UCSCUCSC GoldenPath
mirbase V20Human160 Kbp1908MirbaseMirbase Downloads
vibCho (O395)V. cholerae~4 Mbp2GenBankGenBank Downloads

Searching genomes is computationally hard work and takes a long time if done on un-indexed, linear genomic sequence.  So aligners require that references first be indexed to accelerate later retrieval.  The aligners we are using each require a different index, but use the same method (the Burrows-Wheeler Transform) to get the job done. This involves taking a FASTA file as input, with each chromosome (or contig) as a separate FASTA entry, and producing an aligner-specific set of files as output. Those output index files are then used by the aligner when performing the sequence alignment, and subsequent alignments are reported using coordinates referencing the original FASTA reference files.

...

There are lots of options, but here is a summary of the most important ones. BWA, is a lot more complex than the options let on. If you look at the BWA manual on the web for the aln sub-command, you'll see numerous options that can increase the alignment rate (as well as decrease it), and all sorts of other things. 

OptionEffect
-lControls the length of the seed (default = 32)
-kControls the number of mismatches allowable in the seed of each alignment (default = 2)
-nControls the number of mismatches (or fraction of bases in a given alignment that can be mismatches) in the entire alignment (including the seed) (default = 0.04)
-tControls the number of threads

The rest of the options control the details of how much a mismatch or gap is penalized, limits on the number of acceptable hits per read, and so on.  Much more information can be accessed at the BWA manual page.

...

These are the four reference genomes we will be using today, with some information about them (and here is information about many more genomes): 


Building the bowtie2 mirbase index

...

Now, we're ready to actually try to do the alignment.  Remember, unlike BWA, we actually need to set some options depending on what we're after. Some of the important options for bowtie2 are:

OptionEffect
--end-to-end or --localControls whether the entire read must align to the reference, or whether soft-clipping the ends is allowed to find internal alignments. Default --end-to-end
-LControls the length of seed substrings generated from each read (default = 22)
-NControls the number of mismatches allowable in the seed of each alignment (default = 0)
-iInterval between extracted seeds. Default is a function of read length and alignment mode.
--score-minMinimum alignment score for reporting alignments. Default is a function of read length and alignment mode.

To decide how we want to go about doing our alignment, check out the file we're aligning with less:

...

Expand
titleIf our reads were longer

Because these are short reads we do not have to adjust parameters like inter-seed distance (-i) or minimum alignment score (--min-score) that are a function of read length. If we were processing longer reads, we might need to use parameters like this, to force bowtie2 to "pretend" the read is a short, constant length:

-i C,1,0
--score-min=C,32,0

Yes, that looks complicated, and it kind of is. It's basically saying "slide the seed down the read one base at a time", and "report alignments as long as they have a minimum alignment score of 32 (16 matching bases x 2 points per match, minimum).

See the bowtie2 manual (after you have had a good stiff drink) for a full explanation.

...


Let's make a link to the mirbase index directory to make our command line simpler:

...

bwa mem was made available when we loaded the bwa module, so take a look at its usage information. The most important parameters, similar to those we've manipulated in the past two sections, are the following:

OptionEffect
-kControls the minimum seed length (default = 19)
-wControls the "gap bandwidth", or the length of a maximum gap. This is particularly relevant for MEM, since it can determine whether a read is split into two separate alignments or is reported as one long alignment with a long gap in the middle (default = 100)
-rControls how long an alignment must be relative to its seed before it is re-seeded to try to find a best-fit local match (default = 1.5, e.g. the value of -k multiplied by 1.5)
-cControls how many matches a MEM must have in the genome before it is discarded (default = 10000)
-tControls the number of threads to use

There are many more parameters to control the scoring scheme and other details, but these are the most essential ones to use to get anything of value at all.

...