Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Tip
titleSAMtools version differences

Be sure to check what version of samtools you are using!

The most recent edition of SAMtools is 1.2, which has some important differences from the last version, 0.1.19.  Everything   Most commands for this section is are the same between the two versions, but if you see code from other sources using samtools, the version difference differences may be important.

Samtools view

The utility samtools view provides a way of converting SAM (text format) files to BAM (binary, compressed) files directly. It also provides many, many other functions which we will discuss lster. To get a preview, execute samtools view without any other arguments. You should see:

...

Code Block
languagebash
samtools view -b yeast_pairedend.sam -o yeast_pairedend.bam

...

  • the -b option tells the tool to output BAM

...

  • format
  • the -o option specifies the name of the output BAM file that will be created

...

How do you look at the BAM file contents now? That's simple. Just use samtools view. without the -b option. Remember to pipe output to a pager!

Code Block
languagebash
samtools view yeast_pairedend.bam | more
Tip
titleConverting BAM back to SAM
To convert back from BAM to SAM to read some alignments, you simply remove the -b option and adjust the file names accordingly.

Samtools sort

Look at the SAM file briefly using less. You will notice, if you scroll down, that the alignments are in no particular order, with chromosomes and start positions all mixed up. This makes searching through the file very inefficient. Thus samtools sort is a piece of samtools that provides the ability to re-order entries in the SAM file either by coordinate position or by read name.

If you execute samtools sort without any options, you see its help page:

...

To sort the paired-end yeast BAM file by coordinate, and get a BAM file named cholerayeast_rnaseqpairedend.sort.bam as output, execute the following command:

Code Block
languagebash
samtools sort -O bam -T yeast_pairedend.sort yeast_pairedend.sambam > yeast_pairedend.sort.bam

How do you look at the BAM file contents now? That's simple. Just use samtools view. (Remember to pipe output to a pager!)

...

languagebash

...

  • The -O options says the output format should be BAM
  • The -T options gives a prefix for temporary files produced during sorting
  • By default sort writes its output to standard output, so we use > to redirect to a file named yeast_pairedend.sort.bam

...

Samtools index

Many tools (like the UCSC Genome Browser) only need to use sub-sections of the BAM file at a given point in time. For example, if you are viewing all alignments that are within a particular gene the alignments on other chromosomes generally do not need to be loaded. In order to speed up access, BAM files are indexed, producing BAI files which allow other programs to navigate directly to the alignments of interest. This is especially important when you have many alignments.

...

Here, the syntax is way, way easier.  We We want a BAI-format index which is the default. (CSI-format is used with extremely long contigs, which we aren't considering here - the most common use case are highly polyploid plant genomes), which is the default, so .

So all we have to type is:

Code Block
languagebash
samtools index yeast_pairedend.sort.bam

This will produce a file named yeast_pairedend.bam.bai.

Most of the time when an index is required, it will be automatically located as long as it is in the same directory as its BAM file and shares the same name up until the .bai extension.

Samtools idxstats

Now that we have a sorted, indexed BAM file, we might like to get some simple statistics about the alignment run. For example, we might like to know how many reads aligned to each chromosome/contig. The samtools idxstats is a very simple tool that provides this information. If you type the command without any arguments, you will see that it could not be simpler - just type the following command:

Code Block
languagebash
samtools idxstats yeast_pairedend.sort.bam

The output is a text file with four tab-delimited columns with the following meanings:

...

Tip

If you're mapping to a non-genomic reference such as miRBase miRNAs or another set of genes (a transcriptome), samtools idxstats gives you a quick look at quantitative alignment results.

Samtools flagstat

Finally, we might like to obtain some other statistics, such as the percent of all reads that aligned to the genome. The samtools flagstat tool provides very simple analysis of the SAM flag fields, which includes information like whether reads are properly paired, aligned or not, and a few other things. Its syntax is identical to that of samtools idxstats:

Code Block
languagebash
samtools flagstat yeast_pairedend.bamsort.bam

You should see something like this:

 

Code Block
1184360 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
547664 + 0 mapped (46.24%:-nan%)
1184360 + 0 paired in sequencing
592180 + 0 read1
592180 + 0 read2
473114 + 0 properly paired (39.95%:-nan%)
482360 + 0 with itself and mate mapped
65304 + 0 singletons (5.51%:-nan%)
534 + 0 with mate mapped to a different chr
227 + 0 with mate mapped to a different chr (mapQ>=5)

Ignore the "+ 0" addition to each Ignore the "+ 0" addition to each line - that is a carry-over convention for counting QA-failed reads that is no longer necessary.

The most important statistic is arguably alignment the mapping rate, but this readout allows you to verify that some common expectations (e.g. that about the same number of R1 and R2 reads aligned, and that most mapped reads are proper pairs) are met.