Page History

...

Expand

title	Answer

Code Block

language	bash

module load samtools
cd $SCRATCH/core_ngs/alignment/vibCho
bowtie2 --local -x vibCho/vibCho.O395 -U fq/cholera_rnaseq.fastq.gz 2>aln_local.log | \
  samtools view -b > cholera_rnaseq.local.bam

Reports these alignment statistics:

Code Block
89006 reads; of these: 89006 (100.00%) were unpaired; of these: 27061 (30.40%) aligned 0 times 33828 (38.01%) aligned exactly 1 time 28117 (31.59%) aligned >1 times 69.60% overall alignment rate

Interestingly, the local alignment rate here is lower than we saw with the gloabl alignment.

Exercise #5: BWA-MEM - Human mRNA-seq

After bowtie2 came out with a local alignment option, it wasn't long before bwa developed its own local alignment algorithm called BWA-MEM (for Maximal Exact Matches), implemented by the bwa mem command. bwa mem has the following advantages:

It incorporates a lot of the simplicity of using bwa with the complexities of local alignment, enabling straightforward alignment of datasets like the mirbase data we just examined
It can align different portions of a read to different locations on the genome
- In a total RNA-seq experiment, reads will (at some frequency) span a splice junction themselves
  - or a pair of reads in a paired-end library will fall on either side of a splice junction.
- We want to be able to align these splice-adjacent reads for many reasons, from accurate transcript quantification to novel fusion transcript discovery.

This exercise will align a human total RNA-seq dataset composed (by design) almost exclusively of reads that cross splice junctions.

A word about real splice-aware aligners

Using BWA mem for RNA-seq alignment is sort of a "poor man's" RNA-seq alignment method. Real splice-aware aligners like tophat2, hisat2 or STAR have more complex algorithms (as shown below) – and take a lot more time!

Image Added

In the transcriptome-aware alignment above, reads that span splice junctions are reported in the SAM file with genomic coordinates that start in the first exon and end in the second exon (the CIGAR string uses the N operator, e.g. 30M1000N60M).

BWA MEM does not know about the exon structore of the genome. But it can align different sub-sections of a read to two different locations, producing two alignment records from one input read (one of the two will be marked as secondary (0x100 flag).

BWA MEM splits junction-spanning reads into two alignment records
Image Added

Setup for BWA mem

First set up our working directory for this alignment. Since it takes a long time to build a bwa index for a large genome (here human hg38/GRCh38), we'll use one that the BioITeam maintains in its /work2/projects/BioITeam/ref_genome area.

Code Block

language	bash
title	Run multiple alignments using the TACC batch system

# Make sure you're in an idev session
idev -m 120 -p normal -A UT-2015-05-18 -N 1 -n 68

# Load the modules we'll need
module load biocontainers
module load bwa
module load samtools

# Copy over the FASTQ data if needed
mkdir -p $SCRATCH/core_ngs/alignment/fastq
cp $CORENGS/alignment/*.gz $SCRATCH/core_ngs/alignment/fastq/

# Make a new alignment directory for running these scripts
mkdir -p $SCRATCH/core_ngs/alignment/bwamem
cd $SCRATCH/core_ngs/alignment/bwamem
ln -sf ../fastq
ln -sf /work2/projects/BioITeam/ref_genome/bwa/bwtsw/hg38

Now take a look at bwa mem usage (type bwa mem with no arguments). The most important parameters are the following:

Option	Effect
-k	Controls the minimum seed length (default = 19)
-w	Controls the "gap bandwidth", or the length of a maximum gap. This is particularly relevant for MEM, since it can determine whether a read is split into two separate alignments or is reported as one long alignment with a long gap in the middle (default = 100)
-M	For split reads, mark the shorter read as secondary
-r	Controls how long an alignment must be relative to its seed before it is re-seeded to try to find a best-fit local match (default = 1.5, e.g. the value of -k multiplied by 1.5)
-c	Controls how many matches a MEM must have in the genome before it is discarded (default = 10000)
-t	Controls the number of threads to use

RNA-seq alignment with bwa mem

Based on its help info, this is the structure of the bwa mem command we will use:

Code Block
bwa mem -M <ref.fa> <reads.fq> > outfile.sam

After performing the setup above, execute the following command in your idev session:

Code Block

language	bash

cd $SCRATCH/core_ngs/alignment/bwamem
bwa mem -M hg38/hg38.fa fastq/human_rnaseq.fastq.gz 2>hs_rna.bwamem.log |
  samtools view -b | \
  samtools sort -O BAM -T human_rnaseq.tmp > human_rnaseq.sort.bam

This multi-pipe command performs three steps:

The bwa mem alignment
- the program's progress output (on standard error) is redirected to a log file (2>hs_rna.bwamem.log)
- its alignment records (on standard output) is piped to the next step (conversion to BAM)
Conversion of bwa mem's SAM output to BAM format
- recall that the -b option to samtools view says to output in BAM format
Sorting the BAM file

Because the progress output is being redirected to a log file, we won't see anything until the command completes. Then you should have a human_rnaseq.sort.bam file and an hs_rna.bwamem.log logfile.

Exercise: Compare the number of original FASTQ reads to the number of alignment records.

Expand

title	Hint:

Use the zcat | wc -l | awk idiom to count FASTQ reads.

Use samtools flagstat to report alignment statistics.

Expand

title	Answer:

Count the FASTQ file reads:

Code Block

language	bash

cd $SCRATCH/core_ngs/alignment/bwamem
zcat ./fastq/human_rnaseq.fastq.gz | wc -l | awk '{print $1/4}'

The file has 100,000 reads.

Generate alignment statistics from the sorted BAM file:

Code Block

language	bash

cd $SCRATCH/core_ngs/alignment/bwamem
samtools flagstat human_rnaseq.sort.bam | tee hs_rnaseq.flagstat.txt

Output will look like this:

Code Block

133570 + 0 in total (QC-passed reads + QC-failed reads)
33570 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
133450 + 0 mapped (99.91% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

There were 133,570 alignment records reported for the 100,000 input reads.

Because bwa mem can split reads and report two alignment records for the same read, there are 33,570 secondary reads reported here.

Tip

Be aware that some downstream tools (for example the Picard suite, often used before SNP calling) do not like it when a read name appears more than once in the SAM file. Such reads can be filtered, but only if they can be identified as secondary by specifying the bwa mem -M option as we did above. This option leaves the longest alignmen normally but marks additional alignments for the read as secondary (the 0x100 BAM flag). This designation also allows you to easily filter the secondary reads with samtools view -F 0x104 if desired.

Page tree

Versions Compared

Old Version 13

New Version 14

Key

Exercise #5: BWA-MEM - Human mRNA-seq

A word about real splice-aware aligners

Setup for BWA mem

RNA-seq alignment with bwa mem