Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction to Samtools - manipulating and filtering bam files

As Nathan we showed you yesterday, the main type of output from aligning reads to a databases is a binary alignment file, or BAM file.   These files are compressed, so they can't be viewed using standard unix file viewers such as more, less and head. Samtools allows you to manipulate the .bam files - they can be converted into a non-binary format (SAM format specification here) and can also be ordered and sorted based on the quality of the alignment.   This is a good way to remove low quality reads, or make a bam BAM file restricted to a single chromosome.

We'll be focusing on just a few of samtools functions in this series of exercises:. Since most aligners produce a bam BAM file, we'll work on some basic manipulations of the bam BAM files we produced from our alignments yesterday. 

Most functionality while using bam BAM files can be described as such:

  1.     SAM files are converted into BAM files (samstools samtools view)
  2.     BAM files are sorted by reference coordinates (samtools sort)
  3.     Sorted BAM files are indexed (samtools index)
  4.     Sorted, indexed bam BAM files are filtered (based on location, flags, mapping quality (samtools view with filtering options)

Take a look here for a detailed manual page for each function in samtools.   These steps presume that you are using a mapper/aligners such as bwa, which records both mapped vs and unmapped reads - make sure you check how the aligner writes it's output to samSAM/bam BAM format, or you may get a strange surprise in your output aligned files!

The code block below details some basic samtools functionality:

Code Block
languagebash
titlebasic samtools functionality
samtools view -bH -o outfile_view.bam infile.bam  # #useuse the -c option to just count alignment alignmentsrecords
samtools sort infile.bam outfile.sorted.bam
samtools index aln.sorted.bam

First, logon to stampede and copy the file yeast_pairedend.bam to your scratch $SCRATCH directory:

Code Block
languagebash
titleget copy the yeast_pairedend.bam file from Nathan's scratchyesterday to a new directory "samtools"
ssh user@login8.stampede.tacc.utexas.edu
cdscd $SCRATCH/core_ngs
mkdir samtools
cd samtools
cp  /scratch/02423/nsabell/$SCRATCH/core_ngs/alignment/bam/yeast_pairedend.bam .

Sorting and Indexing a bam file: samtools index, sort

Now that we have a bam BAM file, we need to index it. All bam BAM files need an index, as the they tend to be large and the index allows us to perform computationally complex operations on these files without it taking days to complete.

...

Expand
titleExercise 1 solution
Code Block
languagebash
titlesolution
module load samtools 
samtools sort yeast_pairedend.bam yeast_pairedend_sort # will take 1-2 minutes
samtools index yeast_pairedend_sort.bam

Use ls -lah to see what files you made and how large they are.  This is what mine look like:

Code Block
languagebash
-rwxrwxr-x 1 awh394 G-801021 110M May 26 14:12 yeast_pairedend.bam
-rwxrwxrwx 1 awh394 G-801021  91M May 26 14:13 yeast_pairedend_sort.bam
-rwxrwxrwx 1 awh394 G-801021  20K May 26 14:14 yeast_pairedend_sort.bam.bai

Note how small the index file is!

Samtools flags and mapping rate: calculating the proportion of mapped reads in an aligned bam file

We have a sorted, indexed bam BAM file.  Now we can use other samtools functionality to filter this file and count mapped vs unmapped reads in a given region.  Samtools samtools allows you to sort based on certain flags that are specified on page 4 on the sam format specification.  We'll focus on a couple, below.

Image Added

Here Below are three of the most useful flags to sort on, we. We'll be using the unmapped flag.

Code Block
languagebash
titlea few samtools flags
#sam# SAM specifications and common flag usage (from p4)
0x04 = unmapped
0x02 = part of a properly aligned pair
0x400x400 = optical duplicate  #look# look at samtools rmdup if you need to remove these sequences.

Exercise 2: count unmapped reads vs total reads on chromosome III for the yeast_pairedend_sort.bam file you created above.  What proportion of the reads are mapped?

Expand
titleExercise 2 solution

I've put my output for each line in the comment area.

Code Block
languagebash
titlesolution
module load samtools                                     # if needed
samtools view -bhc yeast_pairedend_sort.bam chrIII   |   wc -l  # total number of reads on this chromosome: 15503
samtools view -bhc -F 0X04 yeast_pairedend_sort.bam chrIII | wc -l  # # reads on this chromosome which are unmapped: 13973

So the total proportion of reads that were unmapped on chromosome III is 13973/15503 or 90.1%, which is really high!  Only ~10% of reads on this chromosome were able to be mapped to the genome.

Filtering bam files based on mapped status and mapping quality using samtools view

Mapping qualities are a measure of how likely a given sequence alignment to a location is correct. The idxstats function will produce a summary of how many reads correspond with each mapping quality score. The lowest score is a mapping quality of zero, or mq0 for short. The reads map to multiple places on the genome, and we can't be sure of where the reads originated.   To improve the quality of our data, we can remove these low quality reads from our sorted and indexed file.

Exercise 3:  Remove unmapped and mq0 low quality reads from your bam file.

Expand
titleExercise 3 solution
Code Block
languagebash
titlesolution
module load samtools # if needed
samtools view -bHb -F 0X040x04 -q 1 -o yeast_pairedend_sort.mapped.q1.bam yeast_pairedend_sort.bam

Here is what my output looks like when I use ls -lah after running the above commands:

Code Block
languagebash
titlesamtools filtering output
-rwxrwxrwx 1 awh394 G-801021  42M May 26 14:26 yeast_pairedend_sort.mapped.q1.bam

Now that we have a bam BAM file with only high-quality mapped reads, we are ready to manipulate this file using bedtools.