While we have focused on aligning eukaryotic data, the same tools can be used to perform identical functions with prokaryotic data. The major differences are less about the underlying data and much more about the external/public databases established to store and distribute reference data. For example, the Illumina iGenome resource provides pre-processed and uniform reference data, designed to be out-of-the-box compatible with aligners like bowtie2 and bwa. However, the limited number of available species are heavily biased towards model eukaryotes. If we wanted to study a prokaryote, the reference data must be downloaded from a resource like GenBank, and processed/indexed similarly to the procedure for mirbase. However, while the alignment procedure is broadly analogous, the reference preparation process is somewhat different. Here, and will involve use of a biologically-oriented scripting library called BioPerl. In this exercise, we will use some RNA-seq data from Vibrio cholerae, published last year on GEO here, and align it to a reference genome.
Overview of Vibrio cholerae alignment workflow with Bowtie2 (essentially
- Prepare the vibCho reference index for bowtie2 from GenBank files using BioPerl
- Align reads using bowtie2, producing a SAM file
- Convert the SAM file to a BAM file (samtools view)
- Sort the BAM file by genomic location (samtools sort)
- Index the BAM file (samtools index)
- Gather simple alignment statistics (samtools flagstat and samtools idxstat)
Exercise #4: BWA-MEM - Human mRNA-seq
Obtaining GenBank sequence files for Vibrio Cholerae
Converting GenBank files into usable reference FASTA files
BioPerl is available through the module system, so load it up like we have done before, with the caveat that we load regular Perl before we load BioPerl.
module load perl module load bioperl
This makes several scripts directly available to you. The one we will use is called "bp_seqconvert.pl," and it is used to interconvert file formats between FASTA, GBK, and others. While we will use it to obtain a FASTA sequence, this same program can also be used to get gene annotations, among other genomic information, from the same GenBank file.
Building the bowtie2 vibCho index
Performing the bowtie2 alignment
(this will stop here because samtools index/idxstats/flagstat will be Exercise #5)
Exercise #4: BWA-MEM - Human mRNA-seq
After bowtie2 came out with a local alignment option, it wasn't long before bwa developed its own local alignment algorithm called BWA-MEM (for Maximal Exact Matches), implemented by the bwa mem command. bwa mem has the following advantages:
Note that Tophat also reports secondary alignments, but they have a different meaning. Tophat always reports spliced alignments as one alignment records with the N CIGAR string operator indicating the gaps. Secondary alignments for Tophat (marked with the 0x100 BAM flag) represent alternate places in the genome where a read (spliced or not) may have mapped.As you can imagine from this series of steps, Tophat is very computationally intensive and takes much longer than bwa mem – very large alignments (hundreds of millions of reads) may not complete in stampede's 48 hour maximum job time!alignments as one alignment records with the N CIGAR string operator indicating the gaps. Secondary alignments for Tophat (marked with the 0x100 BAM flag) represent alternate places in the genome where a read (spliced or not) may have mapped.
As you can imagine from this series of steps, Tophat is very computationally intensive and takes much longer than bwa mem – very large alignments (hundreds of millions of reads) may not complete in stampede's 48 hour maximum job time!
Exercise #5: Simple samtools utilities
We have used several alignment methods that all generate results in the form of the near-universal SAM/BAM file format. The samtools program is an extremely widespread set of tools that allow a user to manipulate SAM/BAM files in many different ways, ranging from simple tasks like converting SAM to BAM (and vice versa) to more complex functions like the removal of PCR duplicates or the filtering of alignments by various metrics. In this exercise, we will explore three very simple utilities provided by samtools: index, flagstat, and idxstats. Each of these is executed in one line for a given SAM/BAM file. In the next section, we will explore samtools much more in depth, and show how it can be connected to other post-alignment tools.