...
After raw sequence files are generated (in FASTQ format), quality-checked, and pre-processed in some way, the next step in most NGS pipelines is mapping to a reference genome. For individual sequences, it is common to use a tool like BLAST to identify genes or species of origin. However, a normal NGS dataset will have tens to hundreds of millions of sequences, which BLAST and similar tools are not designed to handle.
...
In this section, we will primarily focus on two of the most versatile mappers: BWA and Bowtie2, the latter being part of the Tuxedo suite (e.g. transcriptome-aware Tophat2) which also includes tools for manipulating NGS data after alignment.
Connect to
...
login5.
...
ls5.tacc.utexas.edu
This should be second nature by now
...
We first moved ourselves into the fastq directory, then copied over all files that end in "fastq.gz" in the corral directory specified in the second line. These are descriptions of the five files we copied:
File Name | Description | Sample |
---|---|---|
Sample_Yeast_L005_R1.cat.fastq.gz | Paired-end Illumina, First of pair, FASTQ | Yeast ChIP-seq |
Sample_Yeast_L005_R2.cat.fastq.gz | Paired-end Illumina, Second of pair, FASTQ | Yeast ChIP-seq |
human_rnaseq.fastq.gz | Paired-end Illumina, First of pair only, FASTQ | Human RNA-seq |
human_mirnaseq.fastq.gz | Single-end Illumina, FASTQ | Human microRNA-seq |
cholera_rnaseq.fastq.gz | Single-end Illumina, FASTQ | V. cholerae RNA-seq |
Reference Genomes
Before we get to alignment, we need a genome to align to. We will use four different references here:
...
These are the four reference genomes we will be using today, with some information about them (and here is information about many more genomes):
Reference | Species | Base Length | Contig Number | Source | Download |
---|---|---|---|---|---|
hg19 | Human | 3.1 Gbp | 25 (really 93) | UCSC | UCSC GoldenPath |
sacCer3 | Yeast | 12.2 Mbp | 17 | UCSC | UCSC GoldenPath |
mirbase V20 | Human | 160 Kbp | 1908 | Mirbase | Mirbase Downloads |
vibCho (O395) | V. cholerae | ~4 Mbp | 2 | GenBank | GenBank Downloads |
Searching genomes is computationally hard work and takes a long time if done on un-indexed, linear genomic sequence. So aligners require that references first be indexed to accelerate later retrieval. The aligners we are using each require a different index, but use the same method (the Burrows-Wheeler Transform) to get the job done. This involves taking a FASTA file as input, with each chromosome (or contig) as a separate FASTA entry, and producing an aligner-specific set of files as output. Those output index files are then used by the aligner when performing the sequence alignment, and subsequent alignments are reported using coordinates referencing the original FASTA reference files.
...
There are lots of options, but here is a summary of the most important ones. BWA, is a lot more complex than the options let on. If you look at the BWA manual on the web for the aln sub-command, you'll see numerous options that can increase the alignment rate (as well as decrease it), and all sorts of other things.
Option | Effect |
---|---|
-l | Controls the length of the seed (default = 32) |
-k | Controls the number of mismatches allowable in the seed of each alignment (default = 2) |
-n | Controls the number of mismatches (or fraction of bases in a given alignment that can be mismatches) in the entire alignment (including the seed) (default = 0.04) |
-t | Controls the number of threads |
The rest of the options control the details of how much a mismatch or gap is penalized, limits on the number of acceptable hits per read, and so on. Much more information can be accessed at the BWA manual page.
...
These are the four reference genomes we will be using today, with some information about them (and here is information about many more genomes):
Building the bowtie2 mirbase index
...
Now, we're ready to actually try to do the alignment. Remember, unlike BWA, we actually need to set some options depending on what we're after. Some of the important options for bowtie2 are:
Option | Effect |
---|---|
--end-to-end or --local | Controls whether the entire read must align to the reference, or whether soft-clipping the ends is allowed to find internal alignments. Default --end-to-end |
-L | Controls the length of seed substrings generated from each read (default = 22) |
-N | Controls the number of mismatches allowable in the seed of each alignment (default = 0) |
-i | Interval between extracted seeds. Default is a function of read length and alignment mode. |
--score-min | Minimum alignment score for reporting alignments. Default is a function of read length and alignment mode. |
To decide how we want to go about doing our alignment, check out the file we're aligning with less:
...
Expand | ||
---|---|---|
| ||
Because these are short reads we do not have to adjust parameters like inter-seed distance (-i) or minimum alignment score (--min-score) that are a function of read length. If we were processing longer reads, we might need to use parameters like this, to force bowtie2 to "pretend" the read is a short, constant length: -i C,1,0 Yes, that looks complicated, and it kind of is. It's basically saying "slide the seed down the read one base at a time", and "report alignments as long as they have a minimum alignment score of 32 (16 matching bases x 2 points per match, minimum). See the bowtie2 manual (after you have had a good stiff drink) for a full explanation. |
...
Let's make a link to the mirbase index directory to make our command line simpler:
...
bwa mem was made available when we loaded the bwa module, so take a look at its usage information. The most important parameters, similar to those we've manipulated in the past two sections, are the following:
Option | Effect |
---|---|
-k | Controls the minimum seed length (default = 19) |
-w | Controls the "gap bandwidth", or the length of a maximum gap. This is particularly relevant for MEM, since it can determine whether a read is split into two separate alignments or is reported as one long alignment with a long gap in the middle (default = 100) |
-r | Controls how long an alignment must be relative to its seed before it is re-seeded to try to find a best-fit local match (default = 1.5, e.g. the value of -k multiplied by 1.5) |
-c | Controls how many matches a MEM must have in the genome before it is discarded (default = 10000) |
-t | Controls the number of threads to use |
There are many more parameters to control the scoring scheme and other details, but these are the most essential ones to use to get anything of value at all.
...