Page History

...

After raw sequence files are generated (in FASTQ format), quality-checked, and pre-processed in some way, the next step in most NGS pipelines is mapping to a reference genome. For individual sequences, it is common to use a tool like BLAST to identify genes or species of origin. However, a normal NGS dataset will have tens to hundreds of millions of sequences, which BLAST and similar tools are not designed to handle.

...

In this section, we will primarily focus on two of the most versatile mappers: BWA and Bowtie2, the latter being part of the Tuxedo suite (e.g. transcriptome-aware Tophat2) which also includes tools for manipulating NGS data after alignment.

Connect to

...

login5.

...

ls5.tacc.utexas.edu

This should be second nature by now

...

We first moved ourselves into the fastq directory, then copied over all files that end in "fastq.gz" in the corral directory specified in the second line. These are descriptions of the five files we copied:

File Name	Description	Sample
Sample_Yeast_L005_R1.cat.fastq.gz	Paired-end Illumina, First of pair, FASTQ	Yeast ChIP-seq
Sample_Yeast_L005_R2.cat.fastq.gz	Paired-end Illumina, Second of pair, FASTQ	Yeast ChIP-seq
human_rnaseq.fastq.gz	Paired-end Illumina, First of pair only, FASTQ	Human RNA-seq
human_mirnaseq.fastq.gz	Single-end Illumina, FASTQ	Human microRNA-seq
cholera_rnaseq.fastq.gz	Single-end Illumina, FASTQ	V. cholerae RNA-seq

Reference Genomes

Before we get to alignment, we need a genome to align to. We will use four different references here:

...

These are the four reference genomes we will be using today, with some information about them (and here is information about many more genomes):

Reference	Species	Base Length	Contig Number	Source	Download
hg19	Human	3.1 Gbp	25 (really 93)	UCSC	UCSC GoldenPath
sacCer3	Yeast	12.2 Mbp	17	UCSC	UCSC GoldenPath
mirbase V20	Human	160 Kbp	1908	Mirbase	Mirbase Downloads
vibCho (O395)	V. cholerae	~4 Mbp	2	GenBank	GenBank Downloads

Searching genomes is computationally hard work and takes a long time if done on un-indexed, linear genomic sequence. So aligners require that references first be indexed to accelerate later retrieval. The aligners we are using each require a different index, but use the same method (the Burrows-Wheeler Transform) to get the job done. This involves taking a FASTA file as input, with each chromosome (or contig) as a separate FASTA entry, and producing an aligner-specific set of files as output. Those output index files are then used by the aligner when performing the sequence alignment, and subsequent alignments are reported using coordinates referencing the original FASTA reference files.

...

There are lots of options, but here is a summary of the most important ones. BWA, is a lot more complex than the options let on. If you look at the BWA manual on the web for the aln sub-command, you'll see numerous options that can increase the alignment rate (as well as decrease it), and all sorts of other things.

Option	Effect
-l	Controls the length of the seed (default = 32)
-k	Controls the number of mismatches allowable in the seed of each alignment (default = 2)
-n	Controls the number of mismatches (or fraction of bases in a given alignment that can be mismatches) in the entire alignment (including the seed) (default = 0.04)
-t	Controls the number of threads

The rest of the options control the details of how much a mismatch or gap is penalized, limits on the number of acceptable hits per read, and so on. Much more information can be accessed at the BWA manual page.

...

These are the four reference genomes we will be using today, with some information about them (and here is information about many more genomes):

Building the bowtie2 mirbase index

...

Now, we're ready to actually try to do the alignment. Remember, unlike BWA, we actually need to set some options depending on what we're after. Some of the important options for bowtie2 are:

Option	Effect
--end-to-end or --local	Controls whether the entire read must align to the reference, or whether soft-clipping the ends is allowed to find internal alignments. Default --end-to-end
-L	Controls the length of seed substrings generated from each read (default = 22)
-N	Controls the number of mismatches allowable in the seed of each alignment (default = 0)
-i	Interval between extracted seeds. Default is a function of read length and alignment mode.
--score-min	Minimum alignment score for reporting alignments. Default is a function of read length and alignment mode.

To decide how we want to go about doing our alignment, check out the file we're aligning with less:

...

Expand

title	If our reads were longer

Because these are short reads we do not have to adjust parameters like inter-seed distance (-i) or minimum alignment score (--min-score) that are a function of read length. If we were processing longer reads, we might need to use parameters like this, to force bowtie2 to "pretend" the read is a short, constant length:

-i C,1,0
--score-min=C,32,0

Yes, that looks complicated, and it kind of is. It's basically saying "slide the seed down the read one base at a time", and "report alignments as long as they have a minimum alignment score of 32 (16 matching bases x 2 points per match, minimum).

See the bowtie2 manual (after you have had a good stiff drink) for a full explanation.

...

Let's make a link to the mirbase index directory to make our command line simpler:

...

bwa mem was made available when we loaded the bwa module, so take a look at its usage information. The most important parameters, similar to those we've manipulated in the past two sections, are the following:

Option	Effect
-k	Controls the minimum seed length (default = 19)
-w	Controls the "gap bandwidth", or the length of a maximum gap. This is particularly relevant for MEM, since it can determine whether a read is split into two separate alignments or is reported as one long alignment with a long gap in the middle (default = 100)
-r	Controls how long an alignment must be relative to its seed before it is re-seeded to try to find a best-fit local match (default = 1.5, e.g. the value of -k multiplied by 1.5)
-c	Controls how many matches a MEM must have in the genome before it is discarded (default = 10000)
-t	Controls the number of threads to use

There are many more parameters to control the scoring scheme and other details, but these are the most essential ones to use to get anything of value at all.

...

Page tree

Versions Compared

Old Version 125

New Version 126

Key

Connect to

login5.

ls5.tacc.utexas.edu

Reference Genomes

Building the bowtie2 mirbase index