Page History

...

Tip

title	Reservations

Use our summer school reservation (CoreNGSday5CoreNGS-Fri) when submitting batch jobs to get higher priority on the ls6 normal queue today:

sbatch --reservation=CoreNGSday5 CoreNGS-Fri <batch_file>.slurm idev -m 180 -N 1 -A OTH21164 -r CoreNGSday5CoreNGS-Fri

Table of Contents

Overview

...

Code Block

language	bash
title	Start an idev session

idev -m 180 -N 1 -A OTH21164 -r CoreNGSday5 CoreNGS-Fri
# or
idev -m 90 -N 1 -A OTH21164 -p development

Next set up a directory for these exercises, and copy an indexed BAM file there. This is a renamed version of the yeast_pairedendpe.sort.bam file from our The Basic Alignment workflow Workflow exercises.

Code Block

language	bash
title	Setup for samtools exercises

mkdir -p $SCRATCH/core_ngs/samtools
cd $SCRATCH/core_ngs/samtools
cp $CORENGS/catchup/yeastfor_stuffsamtools/*.bam* .

References

Handy links

The SAM format specification
- especially section 1.4 - alignment section fields
Manual for SAMTools
- especially the 1st section on samtools view.

...

Code Block

language	bash

module load biocontainers  # takes a while
module load samtools

cd $SCRATCH/core_ngs/samtools
samtools view yeast_chip_pe.sort.bam | head

With all the line wrapping, it looks pretty ugly. Still, you can pick out the CIGAR string in column 6. Let's select just that column with cut:

Code Block

language	bash

samtools view yeast_chip_pe.sort.bam | cut -f 6 | head -20

...

Code Block

language	bash

samtools view -F 0x4 yeast_chip_pe.sort.bam | cut -f 6 | head -20

Now we use grep with the pattern '[ID]' to select lines (which are now just CIGAR strings) that have Insert or Deletion operators. The brackets ( [ ] ) denote a character class pattern, which matches any of the characters inside the brackets. Be sure to ask for Perl regular expressions (-P) to be sure so that this character class syntax is recognized.

Code Block

language	bash

samtools view -F 0x4 yeast_chip_pe.sort.bam | cut -f 6 | grep -P '[ID]' | head

...

Code Block

language	bash
title	Count reads that mappedi with indels

samtools view -F 0x4 yeast_chip_pe.sort.bam | cut -f 6 | grep -P '[ID]' | wc -l

...

Code Block

language	bash
title	Count all mapped reads

samtools view -c -F 0x4 yeast_chip_pe.sort.bam

There should be 547664 mapped alignments.

...

In addition to the rate, converted to a percentage, we also output the literal percent sign ( % ). The double quotes ( " ) denote a literal string in awk, and the comma between the number and the string says to separate the two fields with whatever the default Output Field Separator (OFS) is. By default, both awk's Input Field Separator (FS) is whitespace (any number of space and Tab characters) and its Output Field Separator is a single space.

So what if we want to get fancy and do all this in a one-liner command? We can use echo and backtick evaluation to put both numbers in a string, then pipe that string to awk. This time we use awk body code, and refer to the two whitespace-separated fields being passed in: total count ($1) and indel count ($2).

Code Block

language	bash
title	One-liner for calculating BAM alignment indel rate

echo "`samtools view -F 0x4 -c yeast_chip_pe.sort.bam` `samtools \
    $( samtools view -F 0x4 yeast_chip_pe.sort.bam | cut -f 6 | grep -P '[ID]' | wc -l`l )" \
    | awk '{ print 100 * $2/$1,"%" }'

...

Tip

awk also has a printf function, which can take the standard formatting commands (see https://en.wikipedia.org/wiki/Printf_format_string#Type_field).

Code Block

language	bash

awk 'BEGIN{ printf("%.2f %%\n", 100*6697/547664) }'

Notes:

The printf arguments are enclosed in parentheses since it is a true function.
The 1st argument is the format string, enclosed in double quotes.
- The %.2f format specifier says to output a floating point number with 2 digits after the decimal place.
- The %% format specifier is used to output a single, literal percent sign.
- Unlike the standard print statement, the printf function does not automatically append a newline to the output, so \n is added to the format string here.
Remaining arguments to printf are the values to be substituted for the format specifiers (here our percentage computation).

...

Code Block

language	bash

# look at a subset of field for chrI:1000-2000 reads
#   2=flags, 3=contig, 4=start, 5=mapping quality, 6=CIGAR, 9=insert size
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 | cut -f 2-6,9 | \
  awk 'BEGIN{FS=OFS="\t"}
       {$1 = sprintf("0x%x", $1); print}'

...

Mapping qualities are a measure of how likely a given sequence alignment to its reported location is correct. If a read's mapping quality is low (especially if it is zero, or mapQ 0 for short) the read maps to multiple locations on the genome (they are multi-hit or multi-mapping reads), and we can't be sure whether the reported location is the correct one.

...

Alternate locations for a mapped read are are will be flagged as secondary (flag 0x100)
While they often provide valuable information, secondary reads must be filtered for some downstream applications
- e.g., ChIP-seq peak calling and variant analysis with GATK.
When secondary reads are reported, the total number of alignment records in the BAM file is greater than the number of reads in the input FASTQ files!
- this affects how the true mapping rate must be calculated
- true mapping rate = (pirmary mapped reads) / (total BAM file sequences - secondary mapped reads)

...

bwa aln (global alignment) and bowtie2 with default parameters (both --local default end-to-end mode) report at most one location for a read that maps
- this will be the location with the best mapping quality and alignment
- if a given read maps equally well to multiple locations, these aligners pick one location at random
  - bwa aln will always report a 0 mapping quality for these multi-hit reads
  - bowtie2 will report a low mapping quality (< 10), based on the complexity (information content) of the sequence
bwa mem (local alignment) can always report more than one location for a mapped read
- its definition of a secondary alignment is different (and a bit non-standard)
  - if one part of a read maps to one location and another part maps somewhere else (e.g. because of RNA splicing), the longer alignment is marked as primary and the shorter as secondary.
- there is no way to disable reporting of secondary alignments with bwa mem.
  - but they can be filtered from the sorted BAM with -F 0x100 (secondary alignment flag = 0).
bowtie2 can be configured to report more than one location for a mapped read
- the -k N option says to report up to N alignments for a read
most transcriptome-aware RNA-seq aligners by default report more than one location for a mapped read
- e.g. hisat2, star, tophat2.
- when reads are quantified (counted with respect to genes), multiply-mapped reads can be counted fractionally
  - e.g. if a read maps to 5 genes, it can be counted as 1/5 for each of the genes

Filtering for high-quality reads

Using our yeast_pe.sort.bam file, let's do some some quality filtering.

...

Expand

title	Solution

Note that we use the -b flag to tell samtools view to output BAM format (not the SAM format text we've been looking at).

Code Block

language	bash
title	solution

samtools view -b -F 0x040x4 -f 0x2 -q 20 yeast_pe.sort.bam > yeast_pe.sort.filt.bam 
# or
samtools view -b -F 0x040x4 -f 0x2 -q 20 -o yeast_pe.sort.filt.bam  yeast_pe.sort.bam

...

Expand

title	Answer

Code Block

language	bash
title	solution

samtools view -c yeast_pe.sort.bam
samtools view -c yeast_pe.sort.filt.bam

# or to be really fancy...
echo "`samtools view -c yeast_pe.sort.filt.bam`bam` \
      `samtools view -c yeast_pe.sort.filt.bam`" | \
       awk '{printf "%d original reads\n", $1;
             printf "%d Q20 reads (%d read pairs)\n", $2, $2/2;
             printf "%0.2f%% high-quality reads\n", 100*$2/$1/$2}'

There were 1,184,360 alignment records in the original BAM, and only 456,890 in the quality-filtered BAM, around 38% of our starting reads.

Since we have only properly paired reads, the filtered BAM will contain equal numbers of both R1s and R2s. So the number of read pairs is 456890/2 or 228451.

...

Page tree

Versions Compared

Old Version 46

New Version Current

Key

Overview

References

Filtering for high-quality reads