Filtering with SAMTools

Overview

As we have seen, the SAMTools suite allows you to manipulate the SAM/BAM files produced by most aligners. There are many sub-commands in this suite, but the most common and useful use cases are:

Convert text-format SAM files into binary BAM files (samtools view) and vice versa
Sort BAM files by reference coordinates (samtools sort)
Index BAM files that have been sorted (samtools index)
Filter alignment records based on BAM flags, mapping quality or location

Since BAM files are binary, they can't be viewed directly using standard Unix file viewers such as more, less and head. We have seen how samtools view can be used to binary-format BAM files into text format for viewing. But samtools view also has options that let you do powerul filtering of the output. We focus on this filtering capability in this set of exercises.

Setup

Login to login5.ls5.tacc.utexas.edu and set up a directory for these exercises and copy an indexed BAM file there. This is a renamed version of the yeast_pairedend.sort.bam file from our Alignment workflow exercises.

Setup for samtools exercises

mkdir -p $SCRATCH/core_ngs/samtools
cd $SCRATCH/core_ngs/samtools
cp $CORENGS/catchup/for_samtools/* .

References

Handy links

The SAM format specification
- especially section 1.4 - alignment section fields
Manual for SAMTools 1.3.1
- especially the 1st section on samtools view.

SAM header fields

The 11 SAM alignment record required fields (Tab-delimited).

SAM flags field

Meaning of each bit (flag) in the SAM alignment records flags field (column 2). The most commonly querried flags are denoted in red.

SAM CIGAR string

Format of the CIGAR string in column 6 of SAM alignment records.

Exercises

Analyzing the CIGAR string for indels

Suppose we want to know how many alignments included insertions or deletions (indels) versus the reference. Looking at the CIGAR field definition table above, we see that insertions are denoted with the character I and deletions with the character D. We'll use this information, along with samtools view, cut and grep, to count the number of indels.

We'll do this first exercise one step at a time to get comfortable with piping commands. Start by just looking at the first few alignment records of the BAM file:

cd $SCRATCH/core_ngs/alignment/samtools
module load samtools
samtools view yeast_pe.sort.bam | head

With all the line wrapping, it looks pretty ugly. Still, you can pick out the CIGAR string in colum 6. Let's select just that column with cut:

samtools view yeast_pe.sort.bam | cut -f 6 | head

Next, make sure we're only looking at alignment records that represent mapped reads. The -F 0x4 option says to filter records where the 0x4 flag (read unmapped) is 0, resulting it only mapped reads being output.

samtools view -F 0x4 yeast_pe.sort.bam | cut -f 6 | head

Now we use grep with the pattern '[ID]' to select lines (which are now just CIGAR strings) that have Insert or Deletion operators. The brackets ( [ ] ) denote a character class pattern, which matches any of the characters inside the brackets. Be sure to ask for Perl regular expressions (-P) so that this character class syntax is recognized.

samtools view -F 0x4 yeast_pe.sort.bam | cut -f 6 | grep -P '[ID]' | head

Ok, this looks good. We're ready to run this against all the alignment records, and count the results:

samtools view -F 0x4 yeast_pe.sort.bam | cut -f 6 | grep -P '[ID]' | wc -l

There are 6697 such records.

What is that in terms of the rate of indels? For that we need to count the total number of mapped reads. Here we can just use the -c (count only) option to samtools view.

samtools view -F 0x4 -c yeast_pe.sort.bam

There should be 547664 mapped alignments.

Knowing these two numbers we can just divide them, using awk (remember, bash only does integer arithmetic). Because we're not piping anything in to awk, any body we specify won't be executed. So we do the math in the BEGIN section:

awk 'BEGIN{ print 100*6697/547664,"%" }'

The result should be 1.22283 %.

In addition to the rate, converted to a percentage, we also output the literal percent sign ( % ). The double quotes ( " ) denote a literal string in awk, and the comma between the number and the string says to separate the two fields with whatever the default Output Field Separator (OFS) is. By default, both awk's Input Field Separator (FS) and Output Field Separator are whitespace (any number of space characters).

So what if we want to get fancy and do all this in a one-liner command? We can use echo and backtick evaluation to put both numbers in a string, then pipe that string to awk. This time we use awk body code, and refer to the two whitespace-separated fields being passed in: total count ($1) and indel count ($2).

echo "`samtools view -F 0x4 -c yeast_pe.sort.bam` `samtools view -F 0x4 yeast_pe.sort.bam | cut -f 6 | grep -P '[ID]' | wc -l`" | awk '{ print 100 * $2/$1,"%" }'

Filtering by location range

Sometimes you just want to examine a subset of reads in detail. Once you have a sorted and indexed BAM, you can use the coordinate filtering options of samtools view to do this. Here are some examples:

# if needed...
cd $SCRATCH/core_ngs/alignment/samtools
module load samtools

# count the number of reads mapped to chromosome 2 (chrII)
samtools view -c -F 0x4 yeast_pe.sort.bam chrII

# count the number of reads mapped to chromosomes 1 or M (chrI, chrM)
samtools view -c -F 0x4 yeast_pe.sort.bam chrI chrM

# count the number of reads mapped to chromosomes 1 that overlap coordinates 1000-2000
samtools view -c -F 0x4 yeast_pe.sort.bam chrI:1000-2000

# since there are only 20 reads in the chrI:1000-2000 region, examine them individually
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000

# look at a subset of field for chrI:1000-2000 reads
# 2=flags, 3=contig, 4=start, 5=mapping quality, 6=CIGAR, 9=insert size
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 | cut -f 2-6,9

The code block below details some basic samtools functionality:

Filtering high quality reads

The code block below details some basic samtools functionality:

basic samtools functionality

samtools view -o outfile_view.bam infile.bam  # use the -c option to just count alignment records
samtools sort infile.bam outfile.sorted.bam
samtools index aln.sorted.bam

First, logon to stampede and copy the file yeast_pairedend.bam to your $SCRATCH directory:

copy the yeast_pairedend.bam file from yesterday to a new directory "samtools"

ssh user@login8.stampede.tacc.utexas.edu
cd $SCRATCH/core_ngs
mkdir samtools
cd samtools
cp  $SCRATCH/core_ngs/alignment/yeast_pairedend.bam .

Samtools flags and mapping rate: calculating the proportion of mapped reads in an aligned bam file

We have a sorted, indexed BAM file. Now we can use other samtools functionality to filter this file and count mapped vs unmapped reads in a given region. samtools allows you to sort based on certain flags that are specified on page 4 on the sam format specification. We'll focus on a couple, below.

Here are three of the most useful flags to sort on. We'll be using the unmapped flag.

a few samtools flags

# SAM specifications common flag usage
0x04 = unmapped
0x02 = part of a properly aligned pair
0x400 = optical duplicate  # look at samtools rmdup if you need to remove these sequences

Exercise 2: count unmapped reads vs total reads on chromosome III for the yeast_pairedend_sort.bam file you created above. What proportion of the reads are mapped?

Exercise 2 solution

I've put my output for each line in the comment area.

solution

module load samtools                                     # if needed
samtools view -c yeast_pairedend_sort.bam chrIII         # total number of reads on this chromosome: 15503
samtools view -c -F 0X04 yeast_pairedend_sort.bam chrIII # reads on this chromosome which are unmapped: 13973

So the total proportion of reads that were unmapped on chromosome III is 13973/15503 or 90.1%, which is really high! Only ~10% of reads on this chromosome were able to be mapped to the genome.

Filtering bam files based on mapped status and mapping quality using samtools view

Mapping qualities are a measure of how likely a given sequence alignment to a location is correct. The lowest score is a mapping quality of zero, or mq0 for short. The reads map to multiple places on the genome, and we can't be sure of where the reads originated. To improve the quality of our data, we can remove these low quality reads from our sorted and indexed file.

Exercise 3: Remove unmapped and low quality reads from your bam file.

Exercise 3 solution