Even though many mapping tools exist, a few individual programs have a dominant "market share" of the NGS world. In this section, we will primarily focus on two of the most versatile general-purpose alignersones: BWA and Bowtie2 (the latter being part of the Tuxedo suite which includes the transcriptome-aware RNA-seq aligner Tophat2 as well as other downstream quantifiaction tools).
These are descriptions of the FASTQ files we copied:
|Sample_Yeast_L005_R1.cat.fastq.gz||Paired-end Illumina, First of pair, FASTQ||Yeast ChIP-seq|
|Sample_Yeast_L005_R2.cat.fastq.gz||Paired-end Illumina, Second of pair, FASTQ||Yeast ChIP-seq|
|human_rnaseq.fastq.gz||Paired-end Illumina, First of pair only, FASTQ||Human RNA-seq|
|human_mirnaseq.fastq.gz||Single-end Illumina, FASTQ||Human microRNA-seq|
|cholera_rnaseq.fastq.gz||Single-end Illumina, FASTQ||V. cholerae RNA-seq|
Before we get to alignment, we need a reference to align to. This is usually an organism's genome, but can also be any set of names sequences, such as a transcriptome or other set of genes.
Searching genomes is computationally hard work and takes a long time if done on un-indexed, linear genomic sequence. So aligners require that references first be indexed to accelerate lookup. The aligners we are using each require a different index, but use the same method (the Burrows-Wheeler Transform) to get the job done.
Building a reference index involves taking a FASTA file as input, with each chromosome (or contig contig (contiguous string of bases, e.g. a chromosome) as a separate FASTA entry, and producing an aligner-specific set of files as output. Those output index files are then used to perform the sequence alignment, and alignments are reported using coordinates referencing names and offset positions based on the original FASTA file contig entries.
We can quickly index the references for the yeast genome, the human miRNAs, and the V. cholerae genome, because they are all small, so we'll grab build each index from the appropriate FASTA files for yeast and human miRNAs references and build each index right before we use them. We will also obtain the special GenBank file that contains both the V. cholerae genome sequence and annotations (a .gbk file). These FASTA files, which you staged above, are: