Exercise #3: BWA-MEM (and Tophat2) - Human mRNA-seq
Obviously, bioinformatics is a viciously cutthroat world. Consequently, after After Bowtie2 came out with a local alignment option, it wasn't long before BWA generated their own local-aligner called BWA-MEM (for Maximal Exact Matches). This aligner is very, very nice because it incorporates a lot of the simplicity of using BWA with the complexities of local alignment. This functionality, while enabling the alignment of datasets like the mirbase data we just examined, also permits more complex alignments, such as that of spliced mRNAs. In a long RNA-seq experiment, reads will (at some frequency) span a splice junction themselves, or a pair of reads in a paired-end library will fall on either side of a splice junction. We want to be able to align reads that do this for many reasons, from accurate transcript quantification to novel fusion transcript discovery. Thus, our last exercise will be the alignment of a human LONG RNA-seq dataset composed (by design) almost exclusively of reads that cross splice junctions.
BWA-MEM should have been loaded when we loaded the BWA module, so to look at the details of MEM alignment, just enter "bwa mem" to get the help menu with the options list. The most important parameters, similar to those we've manipulated in the past two sections, are the following:
|-k||Controls the minimum seed length (default = 19)|
|-w||Controls the "gap bandwidth", or the length of a maximum gap. This is particularly relevant for MEM, since it can determine whether a read is split into two separate alignments, or one long alignment with a long gap in the middle (default = 100)|
|-r||Controls how long an alignment must be relative to its seed before it is re-seeded to try to find a best-fit local match (default = 1.5, e.g. the value of -k multiplied by 1.5)|
|-c||Controls how many matches a MEM must have in the genome before it is discarded (default = 10000)|
|-t||Controls the number of threads to use|
There are many more parameters to control the scoring scheme and other details, but these are the most essential parameters to use to get anything of value at all.
The test file we will be working with is JUST the R1 file from a paired-end total RNA-seq experiment, meaning it is (for our purposes) single-end.