Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleAnswer
Code Block
languagebash
module load samtools
cd $SCRATCH/core_ngs/alignment/vibCho
bowtie2 --local -x vibCho/vibCho.O395 -U fq/cholera_rnaseq.fastq.gz 2>aln_local.log | \
  samtools view -b > cholera_rnaseq.local.bam

Reports these alignment statistics:

Code Block
89006 reads; of these:
  89006 (100.00%) were unpaired; of these:
    27061 (30.40%) aligned 0 times
    33828 (38.01%) aligned exactly 1 time
    28117 (31.59%) aligned >1 times
69.60% overall alignment rate

Interestingly, the local alignment rate here is lower than we saw with the gloabl alignment.

Exercise #5: BWA-MEM - Human mRNA-seq

After bowtie2 came out with a local alignment option, it wasn't long before bwa developed its own local alignment algorithm called BWA-MEM (for Maximal Exact Matches), implemented by the bwa mem command. bwa mem has the following advantages:

  • It incorporates a lot of the simplicity of using bwa with the complexities of local alignment, enabling straightforward alignment of datasets like the mirbase data we just examined
  • It can align different portions of a read to different locations on the genome
    • In a total RNA-seq experiment, reads will (at some frequency) span a splice junction themselves
      • or a pair of reads in a paired-end library will fall on either side of a splice junction.
    • We want to be able to align these splice-adjacent reads for many reasons, from accurate transcript quantification to novel fusion transcript discovery.

This exercise will align a human total RNA-seq dataset composed (by design) almost exclusively of reads that cross splice junctions.

A word about real splice-aware aligners

Using BWA mem for RNA-seq alignment is sort of a "poor man's" RNA-seq alignment method. Real splice-aware aligners like tophat2, hisat2 or STAR have more complex algorithms (as shown below) – and take a lot more time!

Image Added

In the transcriptome-aware alignment above, reads that span splice junctions are reported in the SAM file with genomic coordinates that start in the first exon and end in the second exon (the CIGAR string uses the N operator, e.g. 30M1000N60M).

BWA MEM does not know about the exon structore of the genome. But it can align different sub-sections of a read to two different locations, producing two alignment records from one input read (one of the two will be marked as secondary (0x100 flag).

BWA MEM splits junction-spanning reads into two alignment records

Image Added

Setup for BWA mem

First set up our working directory for this alignment. Since it takes a long time to build a bwa index for a large genome (here human hg38/GRCh38), we'll use one that the BioITeam maintains in its /work2/projects/BioITeam/ref_genome area.

Code Block
languagebash
titleRun multiple alignments using the TACC batch system
# Make sure you're in an idev session
idev -m 120 -p normal -A UT-2015-05-18 -N 1 -n 68

# Load the modules we'll need
module load biocontainers
module load bwa
module load samtools

# Copy over the FASTQ data if needed
mkdir -p $SCRATCH/core_ngs/alignment/fastq
cp $CORENGS/alignment/*.gz $SCRATCH/core_ngs/alignment/fastq/

# Make a new alignment directory for running these scripts
mkdir -p $SCRATCH/core_ngs/alignment/bwamem
cd $SCRATCH/core_ngs/alignment/bwamem
ln -sf ../fastq
ln -sf /work2/projects/BioITeam/ref_genome/bwa/bwtsw/hg38

Now take a look at bwa mem usage (type bwa mem with no arguments). The most important parameters are the following:

OptionEffect
-kControls the minimum seed length (default = 19)
-wControls the "gap bandwidth", or the length of a maximum gap. This is particularly relevant for MEM, since it can determine whether a read is split into two separate alignments or is reported as one long alignment with a long gap in the middle (default = 100)
-MFor split reads, mark the shorter read as secondary
-rControls how long an alignment must be relative to its seed before it is re-seeded to try to find a best-fit local match (default = 1.5, e.g. the value of -k multiplied by 1.5)
-cControls how many matches a MEM must have in the genome before it is discarded (default = 10000)
-tControls the number of threads to use

RNA-seq alignment with bwa mem

Based on its help info, this is the structure of the bwa mem command we will use:

Code Block
bwa mem -M <ref.fa> <reads.fq> > outfile.sam

After performing the setup above, execute the following command in your idev session:

Code Block
languagebash
cd $SCRATCH/core_ngs/alignment/bwamem
bwa mem -M hg38/hg38.fa fastq/human_rnaseq.fastq.gz 2>hs_rna.bwamem.log |
  samtools view -b | \
  samtools sort -O BAM -T human_rnaseq.tmp > human_rnaseq.sort.bam

This multi-pipe command performs three steps:

  • The bwa mem alignment
    • the program's progress output (on standard error) is redirected to a log file (2>hs_rna.bwamem.log)
    • its alignment records (on standard output) is piped to the next step (conversion to BAM)
  • Conversion of bwa mem's SAM output to BAM format
    • recall that the -b option to samtools view says to output in BAM format
  • Sorting the BAM file

Because the progress output is being redirected to a log file, we won't see anything until the command completes. Then you should have a human_rnaseq.sort.bam file and an hs_rna.bwamem.log logfile.

Exercise: Compare the number of original FASTQ reads to the number of alignment records.

Expand
titleHint:

Use the zcat | wc -l  | awk idiom to count FASTQ reads.

Use samtools flagstat to report alignment statistics.

Expand
titleAnswer:

Count the FASTQ file reads:

Code Block
languagebash
cd $SCRATCH/core_ngs/alignment/bwamem
zcat ./fastq/human_rnaseq.fastq.gz | wc -l | awk '{print $1/4}'

The file has 100,000 reads.

Generate alignment statistics from the sorted BAM file:

Code Block
languagebash
cd $SCRATCH/core_ngs/alignment/bwamem
samtools flagstat human_rnaseq.sort.bam | tee hs_rnaseq.flagstat.txt

Output will look like this:

Code Block
133570 + 0 in total (QC-passed reads + QC-failed reads)
33570 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
133450 + 0 mapped (99.91% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

There were 133,570 alignment records reported for the 100,000 input reads.

Because bwa mem can split reads and report two alignment records for the same read, there are 33,570 secondary reads reported here.

Tip

Be aware that some downstream tools (for example the Picard suite, often used before SNP calling) do not like it when a read name appears more than once in the SAM file. Such reads can be filtered, but only if they can be identified as secondary by specifying the bwa mem -M option as we did above. This option leaves the longest alignmen normally but marks additional alignments for the read as secondary (the 0x100 BAM flag). This designation also allows you to easily filter the secondary reads with samtools view -F 0x104 if desired.