Be sure to check what version of samtools you are using!
The most recent edition of SAMtools is 1.2, which has some important differences from the last version, 0.1.19. Everything Most commands for this section is are the same between the two versions, but if you see code from other sources using samtools, the version difference differences may be important.
The utility samtools view provides a way of converting SAM (text format) files to BAM (binary, compressed) files directly. It also provides many, many other functions which we will discuss lster. To get a preview, execute samtools view without any other arguments. You should see:
samtools view -b yeast_pairedend.sam -o yeast_pairedend.bam
- the -b option tells the tool to output BAM
- the -o option specifies the name of the output BAM file that will be created
|title||Converting BAM back to SAM|
How do you look at the BAM file contents now? That's simple. Just use samtools view. without the -b option. Remember to pipe output to a pager!
samtools view yeast_pairedend.bam | more
Look at the SAM file briefly using less. You will notice, if you scroll down, that the alignments are in no particular order, with chromosomes and start positions all mixed up. This makes searching through the file very inefficient. Thus samtools sort is a piece of samtools that provides the ability to re-order entries in the SAM file either by coordinate position or by read name.
If you execute samtools sort without any options, you see its help page:
To sort the paired-end yeast BAM file by coordinate, and get a BAM file named cholerayeast_rnaseqpairedend.sort.bam as output, execute the following command:
samtools sort -O bam -T yeast_pairedend.sort yeast_pairedend.sambam > yeast_pairedend.sort.bam
How do you look at the BAM file contents now? That's simple. Just use samtools view. (Remember to pipe output to a pager!)
- The -O options says the output format should be BAM
- The -T options gives a prefix for temporary files produced during sorting
- By default sort writes its output to standard output, so we use > to redirect to a file named yeast_pairedend.sort.bam
Many tools (like the UCSC Genome Browser) only need to use sub-sections of the BAM file at a given point in time. For example, if you are viewing all alignments that are within a particular gene the alignments on other chromosomes generally do not need to be loaded. In order to speed up access, BAM files are indexed, producing BAI files which allow other programs to navigate directly to the alignments of interest. This is especially important when you have many alignments.
Here, the syntax is way, way easier. We We want a BAI-format index which is the default. (CSI-format is used with extremely long contigs, which we aren't considering here - the most common use case are highly polyploid plant genomes), which is the default, so .
So all we have to type is:
samtools index yeast_pairedend.sort.bam
This will produce a file named yeast_pairedend.bam.bai.
Most of the time when an index is required, it will be automatically located as long as it is in the same directory as its BAM file and shares the same name up until the .bai extension.
Now that we have a sorted, indexed BAM file, we might like to get some simple statistics about the alignment run. For example, we might like to know how many reads aligned to each chromosome/contig. The samtools idxstats is a very simple tool that provides this information. If you type the command without any arguments, you will see that it could not be simpler - just type the following command:
samtools idxstats yeast_pairedend.sort.bam
The output is a text file with four tab-delimited columns with the following meanings:
If you're mapping to a non-genomic reference such as miRBase miRNAs or another set of genes (a transcriptome), samtools idxstats gives you a quick look at quantitative alignment results.
Finally, we might like to obtain some other statistics, such as the percent of all reads that aligned to the genome. The samtools flagstat tool provides very simple analysis of the SAM flag fields, which includes information like whether reads are properly paired, aligned or not, and a few other things. Its syntax is identical to that of samtools idxstats:
samtools flagstat yeast_pairedend.bam
You should see something like this:
1184360 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 547664 + 0 mapped (46.24%:-nan%) 1184360 + 0 paired in sequencing 592180 + 0 read1 592180 + 0 read2 473114 + 0 properly paired (39.95%:-nan%) 482360 + 0 with itself and mate mapped 65304 + 0 singletons (5.51%:-nan%) 534 + 0 with mate mapped to a different chr 227 + 0 with mate mapped to a different chr (mapQ>=5)
Ignore the "+ 0" addition to each line - that is a carry-over convention for counting QA-failed reads that is no longer necessary.
The most important statistic is arguably alignment the mapping rate, but this readout allows you to verify that some common expectations (e.g. that about the same number of R1 and R2 reads aligned, and that most mapped reads are proper pairs) are met.