Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Mapping qualities are a measure of how likely a given sequence alignment to its reported location is correct. If a read's mapping quality is low (especially if it is zero, or mapQ 0 for short) the read maps to multiple locations on the genome (they are multi-hit or multi-mapping reads), and we can't be sure whether the reported location is the correct one.

...

  • Alternate locations for a mapped read are are will be flagged as secondary (flag 0x100)
  • While they often provide valuable information, secondary reads must be filtered for some downstream applications
    • e.g., ChIP-seq peak calling and variant analysis with GATK.
  • When secondary reads are reported, the total number of alignment records in the BAM file is greater than the number of reads in the input FASTQ files!
    • this affects how the true mapping rate must be calculated
    • true mapping rate = (pirmary mapped reads) / (total BAM file sequences - secondary mapped reads)

...

  • bwa aln (global alignment) and bowtie2 with default parameters (both --local default end-to-end mode) report at most one location for a read that maps
    • this will be the location with the best mapping quality and alignment
    • if a given read maps equally well to multiple locations, these aligners pick one location at random
      • bwa aln will always report a 0 mapping quality for these multi-hit reads
      • bowtie2 will report a low mapping quality (< 10), based on the complexity (information content) of the sequence
  • bwa mem (local alignment) can always report more than one location for a mapped read
    • its definition of a secondary alignment is different (and a bit non-standard)
      • if one part of a read maps to one location and another part maps somewhere else (e.g. because of RNA splicing), the longer alignment is marked as primary and the shorter as secondary.
    • there is no way to disable reporting of secondary alignments with bwa mem.
      • but they can be filtered from the sorted BAM with -F 0x100 (secondary alignment flag = 0).
  • bowtie2 can be configured to report more than one location for a mapped read
    • the -k N option says to report up to N alignments for a read
  • most transcriptome-aware RNA-seq aligners by default report more than one location for a mapped read
    • e.g. hisat2, star, tophat2.
    • when reads are quantified (counted with respect to genes), multiply-mapped reads can be counted fractionally
      • e.g. if a read maps to 5 genes, it can be counted as 1/5 for each of the genes

Filtering for high-quality reads

Using our yeast_pe.sort.bam file, let's do some some quality filtering.

...

Expand
titleSolution

Note that we use the -b flag to tell samtools view to output BAM format (not the SAM format text we've been looking at).

Code Block
languagebash
titlesolution
samtools view -b -F 0x040x4 -f 0x2 -q 20 yeast_pe.sort.bam > yeast_pe.sort.filt.bam 
# or
samtools view -b -F 0x040x4 -f 0x2 -q 20 -o yeast_pe.sort.filt.bam  yeast_pe.sort.bam 


...

Expand
titleAnswer


Code Block
languagebash
titlesolution
samtools view -c yeast_pe.sort.bam
samtools view -c yeast_pe.sort.filt.bam

# or to be really fancy...
echo "`samtools view -c yeast_pe.sort.filt.bam` \
      `samtools view -c yeast_pe.sort.filt.bam`" | \
       awk '{printf "%d original reads\n", $1;
             printf "%d Q20 reads (%d read pairs)\n", $2, $2/2;
             printf "%0.2f%% high-quality reads\n", 100*$2/$1/$2}'

There were 1,184,360 alignment records in the original BAM, and only 456,890 in the quality-filtered BAM, around 38% of our starting reads.

Since we have only properly paired reads, the filtered BAM will contain equal numbers of both R1s and R2s. So the number of read pairs is 456890/2 or 228451.

...