Overview
As we have seen, the SAMTools suite allows you to manipulate the SAM/BAM files produced by most aligners. There are many sub-commands in this suite, but the most common and useful use cases are:
- Convert text-format SAM files into binary BAM files (samtools view) and vice versa
- Sort BAM files by reference coordinates (samtools sort)
- Index BAM files that have been sorted (samtools index)
- Filter alignment records based on BAM flags, mapping quality or location
Since BAM files are binary, they can't be viewed directly using standard Unix file viewers such as more, less and head. We have seen how samtools view can be used to binary-format BAM files into text format for viewing. But samtools view also has options that let you do powerul filtering of the output. We focus on this filtering capability in this set of exercises.
Setup
Login to login5.ls5.tacc.utexas.edu and set up a directory for these exercises and copy an indexed BAM file there. This is a renamed version of the yeast_pairedend.sort.bam file from our Alignment workflow exercises.
mkdir -p $SCRATCH/core_ngs/samtools cd $SCRATCH/core_ngs/samtools cp $CORENGS/catchup/for_samtools/* .
References
Handy links
- The SAM format specification
- especially section 1.4 - alignment section fields
- Manual for SAMTools 1.3.1
- especially the 1st section on samtools view.
SAM header fields
The 11 SAM alignment record required fields (Tab-delimited).
SAM flags field
Meaning of each bit (flag) in the SAM alignment records flags field (column 2). The most commonly querried flags are denoted in red.
SAM CIGAR string
Format of the CIGAR string in column 6 of SAM alignment records.
Exercises
Analyzing the CIGAR string for indels
Suppose we want to know how many alignments included insertions or deletions (indels) versus the reference. Looking at the CIGAR field definition table above, we see that insertions are denoted with the character I and deletions with the character D. We'll use this information, along with samtools view, cut and grep, to count the number of indels.
We'll do this first exercise one step at a time to get comfortable with piping commands. Start by just looking at the first few alignment records of the BAM file:
cd $SCRATCH/core_ngs/alignment/samtools module load samtools samtools view yeast_pe.sort.bam | head
With all the line wrapping, it looks pretty ugly. Still, you can pick out the CIGAR string in colum 6. Let's select just that column with cut:
samtools view yeast_pe.sort.bam | cut -f 6 | head
Next, make sure we're only looking at alignment records that represent mapped reads. The -F 0x4 option says to filter records where the 0x4 flag (read unmapped) is 0, resulting it only mapped reads being output.
samtools view -F 0x4 yeast_pe.sort.bam | cut -f 6 | head
Now we use grep with the pattern '[ID]' to select lines (which are now just CIGAR strings) that have Insert or Deletion operators. The brackets ( [ ] ) denote a character class pattern, which matches any of the characters inside the brackets. Be sure to ask for Perl regular expressions (-P) so that this character class syntax is recognized.
samtools view -F 0x4 yeast_pe.sort.bam | cut -f 6 | grep -P '[ID]' | head
Ok, this looks good. We're ready to run this against all the alignment records, and count the results:
samtools view -F 0x4 yeast_pe.sort.bam | cut -f 6 | grep -P '[ID]' | wc -l
There are 6697 such records.
What is that in terms of the rate of indels? For that we need to count the total number of mapped reads. Here we can just use the -c (count only) option to samtools view.
samtools view -F 0x4 -c yeast_pe.sort.bam
x
x
x
Filtering for only mapped reads (samtools view -F 0x4)
The code block below details some basic samtools functionality:
samtools view -o outfile_view.bam infile.bam # use the -c option to just count alignment records samtools sort infile.bam outfile.sorted.bam samtools index aln.sorted.bam
First, logon to stampede and copy the file yeast_pairedend.bam to your $SCRATCH directory:
ssh user@login8.stampede.tacc.utexas.edu cd $SCRATCH/core_ngs mkdir samtools cd samtools cp $SCRATCH/core_ngs/alignment/yeast_pairedend.bam .
Samtools flags and mapping rate: calculating the proportion of mapped reads in an aligned bam file
We have a sorted, indexed BAM file. Now we can use other samtools functionality to filter this file and count mapped vs unmapped reads in a given region. samtools allows you to sort based on certain flags that are specified on page 4 on the sam format specification. We'll focus on a couple, below.
Here are three of the most useful flags to sort on. We'll be using the unmapped flag.
# SAM specifications common flag usage 0x04 = unmapped 0x02 = part of a properly aligned pair 0x400 = optical duplicate # look at samtools rmdup if you need to remove these sequences
Exercise 2: count unmapped reads vs total reads on chromosome III for the yeast_pairedend_sort.bam file you created above. What proportion of the reads are mapped?
Filtering bam files based on mapped status and mapping quality using samtools view
Mapping qualities are a measure of how likely a given sequence alignment to a location is correct. The lowest score is a mapping quality of zero, or mq0 for short. The reads map to multiple places on the genome, and we can't be sure of where the reads originated. To improve the quality of our data, we can remove these low quality reads from our sorted and indexed file.
Exercise 3: Remove unmapped and low quality reads from your bam file.