Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Even though many mapping tools exist, a few individual programs have a dominant "market share" of the NGS world. In this section, we will primarily focus on two of the most versatile general-purpose alignersones: BWA and Bowtie2 (the latter being part of the Tuxedo suite which includes the transcriptome-aware RNA-seq aligner Tophat2 as well as other downstream quantifiaction tools).

...

These are descriptions of the FASTQ files we copied:

File NameDescriptionSample
Sample_Yeast_L005_R1.cat.fastq.gzPaired-end Illumina, First of pair, FASTQYeast ChIP-seq
Sample_Yeast_L005_R2.cat.fastq.gzPaired-end Illumina, Second of pair, FASTQYeast ChIP-seq
human_rnaseq.fastq.gzPaired-end Illumina, First of pair only, FASTQHuman RNA-seq
human_mirnaseq.fastq.gzSingle-end Illumina, FASTQHuman microRNA-seq
cholera_rnaseq.fastq.gzSingle-end Illumina, FASTQV. cholerae RNA-seq

...

Before we get to alignment, we need a reference to align to. This is usually an organism's genome, but can also be any set of names sequences, such as a transcriptome or other set of genes.

...

Searching genomes is computationally hard work and takes a long time if done on un-indexed, linear genomic sequence. So aligners require that references first be indexed to accelerate lookup. The aligners we are using each require a different index, but use the same method (the Burrows-Wheeler Transform) to get the job done.

Building a reference index involves taking a FASTA file as input, with each chromosome (or contig contig (contiguous string of bases, e.g. a chromosome) as a separate FASTA entry, and producing an aligner-specific set of files as output. Those output index files are then used to perform the sequence alignment, and alignments are reported using coordinates referencing names and offset positions based on the original FASTA file contig entries.

We can quickly index the references for the yeast genome, the human miRNAs, and the V. cholerae genome, because they are all small, so we'll grab build each index from the appropriate FASTA files for yeast and human miRNAs references and build each index right before we use them. We will also obtain the special GenBank file that contains both the V. cholerae genome sequence and annotations (a .gbk file). These FASTA files, which you staged above, are:

...