Overview

As we have seen, the SAMTools suite allows you to manipulate the SAM/BAM files produced by most aligners. There are many sub-commands in this suite, but the most common and useful are:

  1. Convert text-format SAM files into binary BAM files (samtools view) and vice versa
  2. Sort BAM files by reference coordinates (samtools sort)
  3. Index BAM files that have been sorted (samtools index)
  4. Filter alignment records based on BAM flags, mapping quality or location (samtools view)

Since BAM files are binary, they can't be viewed directly using standard Unix file viewers such as more, less and head. We have seen how samtools view can be used to binary-format BAM files into text format for viewing. But samtools view also has options that let you do powerful filtering of the output. We focus on this filtering capability in this set of exercises.

The most common samtools view filtering options are:

Setup

Login to stampede2.tacc.utexas.edu and start an idev session.

idev -p normal -m 180 -A UT-2015-05-18 -N 1 -n 68

Next set up a directory for these exercises, and copy an indexed BAM file there. This is a renamed version of the yeast_pairedend.sort.bam file from our Alignment workflow exercises.

mkdir -p $SCRATCH/core_ngs/samtools
cd $SCRATCH/core_ngs/samtools
cp $CORENGS/catchup/for_samtools/* .

References

Handy links

SAM header fields

The 11 SAM alignment record required fields (Tab-delimited).

SAM flags field

Meaning of each bit (flag) in the SAM alignment records flags field (column 2). The most commonly flags are denoted in red.

SAM CIGAR string

Format of the CIGAR string in column 6 of SAM alignment records.

Exercises

Analyzing the CIGAR string for indels

Suppose we want to know how many alignments included insertions or deletions (indels) versus the reference. Looking at the CIGAR field definition table above, we see that insertions are denoted with the character I and deletions with the character D. We'll use this information, along with samtools view, cut and grep, to count the number of indels.

mkdir -p $SCRATCH/core_ngs/samtools
cd $SCRATCH/core_ngs/samtools
cp $CORENGS/catchup/for_samtools/* .

We'll do this first exercise one step at a time to get comfortable with piping commands. Start by just looking at the first few alignment records of the BAM file:

module load biocontainers  # takes a while
module load samtools

cd $SCRATCH/core_ngs/samtools
samtools view yeast_pe.sort.bam | head

With all the line wrapping, it looks pretty ugly. Still, you can pick out the CIGAR string in colum 6. Let's select just that column with cut:

samtools view yeast_pe.sort.bam | cut -f 6 | head -20

Next, make sure we're only looking at alignment records that represent mapped reads. The -F 0x4 option says to filter records where the 0x4 flag (read unmapped) is 0, resulting it only mapped reads being output.

samtools view -F 0x4 yeast_pe.sort.bam | cut -f 6 | head -20

Now we use grep with the pattern '[ID]' to select lines (which are now just CIGAR strings) that have Insert or Deletion operators. The brackets ( [ ] ) denote a character class pattern, which matches any of the characters inside the brackets. Be sure to ask for Perl regular expressions (-P) to be sure that this character class syntax is recognized.

samtools view -F 0x4 yeast_pe.sort.bam | cut -f 6 | grep -P '[ID]' | head 

Ok, this looks good. We're ready to run this against all the alignment records, and count the results:

samtools view -F 0x4 yeast_pe.sort.bam | cut -f 6 | grep -P '[ID]' | wc -l 

There are 6697 such records.

What is that in terms of the rate of indels? For that we need to count the total number of mapped reads. Here we can just use the -c (count only) option to samtools view.

samtools view -c -F 0x4 yeast_pe.sort.bam

There should be 547664 mapped alignments.

Knowing these two numbers we can just divide them, using awk (remember, bash only does integer arithmetic). Because we're not piping anything in to awk, any body we specify won't be executed. So we do the math in the BEGIN section:

awk 'BEGIN{ print 100*6697/547664,"%" }'

The result should be 1.22283 %.

In addition to the rate, converted to a percentage, we also output the literal percent sign ( % ). The double quotes ( " ) denote a literal string in awk, and the comma between the number and the string says to separate the two fields with whatever the default Output Field Separator (OFS) is. By default, both awk's Input Field Separator (FS) is whitespace (any number of space and Tab characters) and its Output Field Separator is a single space.

So what if we want to get fancy and do all this in a one-liner command? We can use echo and backtick evaluation to put both numbers in a string, then pipe that string to awk. This time we use awk body code, and refer to the two whitespace-separated fields being passed in: total count ($1) and indel count ($2).

echo "`samtools view -F 0x4 -c yeast_pe.sort.bam` `samtools view -F 0x4 yeast_pe.sort.bam | cut -f 6 | grep -P '[ID]' | wc -l`" | awk '{ print 100 * $2/$1,"%" }'

awk also has a printf function, which can take the standard formatting commands (see https://en.wikipedia.org/wiki/Printf_format_string#Type_field).

awk 'BEGIN{ printf("%.2f %%\n", 100*6697/547664) }'

Notes:

  • The printf arguments are enclosed in parentheses since it is a true function.
  • The 1st argument is the format string, enclosed in double quotes.
    • The %.2f format specifier says to output a floating point number with 2 digits after the decimal place.
    • The %% format specifier is used to output a single, literal percent sign.
    • Unlike the standard print statement, the printf function does not automaticlly append a newline to the output, so \n is added to the format string here.
  • Remaining arguments to printf are the values to be substituted for the format specifiers.
    • Here just our percentage computation

Filtering by location range

Sometimes you just want to examine a subset of reads in detail. Once you have a sorted and indexed BAM, you can use the coordinate filtering options of samtools view to do this. Here are some examples:

When you're interested in mapped reads (which is true most of the time) be sure to specify the -F 0x4 option, which says to filter records where the 0x4 flag (read unmapped) is 0, resulting it only mapped reads being output.

mkdir -p $SCRATCH/core_ngs/samtools
cd $SCRATCH/core_ngs/samtools
cp $CORENGS/catchup/for_samtools/* .
module load biocontainers
module load samtools
cd $SCRATCH/core_ngs/samtools

# count the number of reads mapped to chromosome 2 (chrII)
samtools view -c -F 0x4 yeast_pe.sort.bam chrII

# count the number of reads mapped to chromosomes 1 or M (chrI, chrM)
samtools view -c -F 0x4 yeast_pe.sort.bam chrI chrM

# count the number of reads mapped to chromosomes 1 that overlap coordinates 1000-2000
samtools view -c -F 0x4 yeast_pe.sort.bam chrI:1000-2000

# since there are only 20 reads in the chrI:1000-2000 region, examine them individually
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000

# look at a subset of field for chrI:1000-2000 reads
#   2=flags, 3=contig, 4=start, 5=mapping quality, 6=CIGAR, 9=insert size
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 | cut -f 2-6,9

You can also provide a BED-format file with one record for each desired overlap region: samtools view -L <bed_file>. This is a quick way to perform one of the functions of bedtools intersect (in BAM to BAM mode).

Since you will find yourself wanting to interpret the flags field, and it's easier to do that when it is represented as hexadecimal, let's use awk to help do that. 

# look at a subset of field for chrI:1000-2000 reads
#   2=flags, 3=contig, 4=start, 5=mapping quality, 6=CIGAR, 9=insert size
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 | cut -f 2-6,9 | \
  awk 'BEGIN{FS=OFS="\t"}{$1 = sprintf("0x%x", $1); print}'

Notes:

Exercise: How many of the chrI:1000-2000 alignments are from properly paired mapped reads?

Properly paired reads have the 0x2 flag set (1). All our reads also have the 0x1 flag set because they are paired-end reads. Mapped reads will have the 0x4 flag cleared (0), and properly paired mapped reads will have the 0x8 flag cleared (0) as well (mate not unmapped = mate mapped). So all mapped proper pairs will end in 0x3 = binary 0011.

Also, using '$' in a grep pattern anchors the pattern to the end of the line (only patterns matching before a linefeed are printed).

Here's one way to do it, building on our last command line:

samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 | awk '
  BEGIN{FS=OFS="\t"}
  {$2 = sprintf("0x%x", $2); print $2}' | grep -c '3$'

Note that here we don't need the line continuation character because the newline after the first single quote is part of the script command string.

Use the 0x2 flag (1 = properly paired)

Here's a simpler way of doing it:

samtools view -c -F 0x4 -f 0x2 yeast_pe.sort.bam chrI:1000-2000

About mapping quality

Mapping qualities are a measure of how likely a given sequence alignment to its reported location is correct. If a read's mapping quality is low (especially if it is zero, or mapQ 0 for short) the read maps to multiple locations on the genome (they are multi-hit or multi-mapping reads), and we can't be sure whether the reported location is the correct one.

Aligners also differ in whether they report alternate alignments for multi-hit reads. Some things to keep in mind:

Here are some examples of how different aligners handle reporting of multi-hit reads and their mapping qualities:

Filtering high-quality reads

Using our yeast_pe.sort.bam file, let's do some some quality filtering.

mkdir -p $SCRATCH/core_ngs/samtools
cd $SCRATCH/core_ngs/samtools
cp $CORENGS/catchup/for_samtools/* .
module load biocontainers
module load samtools

Exercise:  Use samtools view with -F, -f and -q options to create a BAM containing only mapped, properly paired, high-quality (mapQ 20+) reads. Call the output file yeast_pe.sort.filt.bam.

Note that we use the -b flag to tell samtools view to output BAM format (not the SAM format text we've been looking at).

samtools view -F 0x04 -f 0x2 -q 20 -b yeast_pe.sort.bam > yeast_pe.sort.filt.bam 
# or
samtools view -F 0x04 -f 0x2 -q 20 -b -o yeast_pe.sort.filt.bam yeast_pe.sort.bam 

Exercise: How many records are in the filtered BAM compared to the original? How many read pairs does this represent?

samtools view -c

samtools view -c yeast_pe.sort.bam
samtools view -c yeast_pe.sort.filt.bam

There were 1,184,360 alignment records in the original BAM, and only 456,890 in the quality-filtered BAM, around 38% of our starting reads.

Since we have only properly paired reads, the filtered BAM will contain equal numbers of both R1s and R2s. So the number of read pairs is 456890/2 or 228451.

Exercise: How many primary aligned reads (0x100 = 0) are in the bwa_local.sort.dup.bam file?

samtools view only pays attention to one -F or -f option, so to specify more than one flag value you need to combine them into one number. For example, combining 0x100 and 0x4 yields 0x104.

You want both the secondary (0x100) and unmapped (0x4) flags to be 0.

samtools view -F 0x104 -c bwa_local.sort.dup.bam

There are 874791 primary alignments.

samtools view -F 0x4 -f 0x100 -c bwa_local.sort.dup.bam

And 133209 secondary alignments.