Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We have used several alignment methods that all generate results in the form of the near-universal SAM/BAM file format.  The SAMtools program is an ubiquitously used set of tools that allow a user to manipulate SAM/BAM files in many different ways, ranging from simple tasks (like SAM/BAM interconversion) to more complex functions (like removal of PCR duplicates).  It is available in the TACC module system in the typical fashion.  In

In this exercise, we will use five very simple utilities provided by samtools, each of which takes only one line to fully run: view, sort, index, flagstat, and idxstats.  Each Each of these is executed in one line for a given SAM/BAM file.  In In the SAMtools/BEDtools section , tomorrow we will explore samtools much in more in depth.

For the sake of time and simplicity, here we are only going to run these commands on the yeast paired-end alignment file.  However, the exact The same commands can be run on the other files by changing the names, so feel free to try executing the same commands them on other SAM files.  IndeedIndeed, it is very common in practice to use bash loops to generate many commands for a large set of alignments and deposit those commands into a batch job cmds file for submission.

To start, we will move to the directory containing our SAM files, among other things, and load up samtools using the module system.  After After loading it, just run the samtools command to see what the available tools are (and to see what the syntax of an actual command is).

Code Block
languagebash
cd $SCRATCH/core_ngs/alignment
ls -la
module load samtools
samtools

...

NOTE: The most recent edition of SAMtools is 1.2, which has some important differences from the last version, 0.1.19.  Everything for this section is the same between the two versions, but if you see code from other sources using samtools, the version difference may be important.

Samtools

...

view

The utility samtools view provides a way of converting SAM (text format) files to BAM (binary, compressed) files directly. It also provides many, many other functions which we will discuss lster. To get a preview, execute samtools view without any other arguments. You should see

...

Look at the SAM file briefly using less.  You will notice, if you scroll down, that the alignments are in no particular order, with chromosomes and start positions all mixed up.  This makes searching through the file very inefficient.  Thus, samtools sort is a piece of samtools that provides the ability to re-order entries in the SAM file either by coordinate or by read name.  If you execute samtools sort without any options, you will see its help page as follows:

Code Block
Usage:   samtools sortview [options] <in.bam>|<in.sam>|<in.]cram> [inregion ...bam]
Options:
 -b -l INT     Setoutput compressionBAM
 level, from 0 (uncompressed) to 9 (best)
  -mC    INT   output CRAM Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n(requires -T)
         -1       use fast SortBAM bycompression read name(implies -b)
  -o    FILE   -u Write final output to FILE rather thanuncompressed standardBAM output
 (implies -O FORMAT  Write output as FORMAT ('sam'/'bam'/'cram')b)
        (either -Oh or
  -T PREFIX  Write temporaryinclude filesheader to PREFIX.nnnn.bam in SAM output
      -T is required)
  -@H INT     Set numberprint ofSAM sortingheader andonly compression threads [1]
Legacy usage: samtools sort [options...] <in.bam> <out.prefix>
Options:
  -f(no alignments)
         -c       print only Usethe <out.prefix>count as full final filename rather than prefix
of matching records
         -o FILE  output file name [stdout]
         -U WriteFILE final output reads not selected by filters to stdout rather than <out.prefix>.bam
  -l,m,n,@   Similar to corresponding options above

In most cases, you would want to sort the file by position rather than by name.  The required options are -O and -T, with -o being useful to specify the output file (either -o or '>' can be used to direct the output to the proper location).  So, to sort the paired-end yeast SAM file by coordinate, and get a SAM file named cholera_rnaseq.sort.sam out as output, we execute the following command, and use less to evaluate the output:

Code Block
samtools sort -O sam -T yeast_pairedend -o yeast_pairedend.sort.sam yeast_pairedend.sam
less yeast_pairedend.sort.sam

Samtools view

You may have noticed in the last help page that samtools sort can specify a BAM file as input or output, which is the smaller, binary form of a SAM file.  This is a viable option if the file needs sorting - however, in many cases you may just want to compress a SAM file by conversion to BAM without any modifications.  The utility samtools view provides a way of converting SAM files to BAM files directly.  It also provides many, many other functions which we will discuss in the next section.  To get a preview, execute samtools view without any other arguments.  You will see:

Code Block
Usage:   samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]
Options: -b       output BAM FILE [null]
         -t FILE  FILE listing reference names and lengths (see long help) [null]
         -T FILE  reference sequence FASTA FILE [null]
         -L FILE  only include reads overlapping this BED FILE [null]
         -r STR   only include reads in read group STR [null]
         -R FILE  only include reads with read group listed in FILE [null]
         -q INT   only include reads with mapping quality >= INT [0]
         -Cl STR   only include reads outputin CRAMlibrary (requires -T)STR [null]
         -1m INT   only include reads with usenumber of fastCIGAR BAMoperations
 compression (implies -b)
         -u      consuming uncompressedquery BAMsequence output>= (implies -b)INT [0]
         -h       include headerf INT   only include reads with all bits set in INT set in SAMFLAG output[0]
         -HF INT   only include reads print SAM header only (no alignments)with none of the bits set in INT
         -c       print only theset countin of matching recordsFLAG [0]
         -ox FILESTR  output fileread name [stdout]
   tag to strip (repeatable) [null]
         -UB FILE  output reads not selected bycollapse filtersthe tobackward FILECIGAR [null]operation
         -ts FILEFLOAT integer FILEpart listingsets referenceseed namesof andrandom lengthsnumber (see long help) [null]
generator [0];
                  -Trest FILEsets fraction referenceof sequencetemplates FASTAto FILEsubsample [nullno subsampling]
         -L@ FILEINT  only includenumber readsof overlappingBAM thiscompression BED FILEthreads [null0]
         -r?    STR   onlyprint includelong readshelp, inincluding readnote groupabout STRregion [null]specification
         -RS FILE  only include  reads withignored read(input groupformat listed in FILE [null]
         -q INT   only include reads with mapping quality >= INT [0]
         -l STR   only include reads in library STR [null]
         -m INT   only include reads with number of CIGAR operations
                  consuming query sequence >= INT [0]
         -f INT   only include reads with all bits set in INT set in FLAG [0]
         -F INT   only include reads with none of the bits set in INT
                  set in FLAG [0]
         -x STR   read tag to strip (repeatable) [null]
         -B       collapse the backward CIGAR operation
         -s FLOAT integer part sets seed of random number generator [0];
                  rest sets fraction of templates to subsample [no subsampling]
         -@ INT   number of BAM compression threads [0]
         -?       print long help, including note about region specification
         -S       ignored (input format is auto-detected)

That is a lot to process! For now, we just want to read in a SAM file and output a BAM file.  The input format is auto-detected, so we don't need to say that we're inputing a SAM instead of a BAM.  We just need to tell the tool to output the file in BAM format, and provide the name of the destination BAM file.  This command is as follows:

Code Block
samtools view -b yeast_pairedend.sort.sam -o yeast_pairedend.bam

...

is auto-detected)

That is a lot to process! For now, we just want to read in a SAM file and output a BAM file. The input format is auto-detected, so we don't need to say that we're inputing a SAM instead of a BAM. We just need to tell the tool to output the file in BAM format, and provide the name of the destination BAM file. This command is as follows:

Code Block
languagebash
samtools view -b yeast_pairedend.sort.sam -o yeast_pairedend.bam

Above, the -b option tells the tool to output BAM, and the -o option specifies the name of the BAM file that will be created.

Tip
titleConverting BAM back to SAM

To convert back from BAM to SAM to read some alignments, you simply remove the -b option and adjust the file names accordingly.

Samtools sort

Look at the SAM file briefly using less. You will notice, if you scroll down, that the alignments are in no particular order, with chromosomes and start positions all mixed up. This makes searching through the file very inefficient. Thus samtools sort is a piece of samtools that provides the ability to re-order entries in the SAM file either by coordinate position or by read name. If you execute samtools sort without any options, you see its help page:

Code Block
Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -o FILE    Write final output to FILE rather than standard output
  -O FORMAT  Write output as FORMAT ('sam'/'bam'/'cram')   (either -O or
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam       -T is required)
  -@ INT     Set number of sorting and compression threads [1]
Legacy usage: samtools sort [options...] <in.bam> <out.prefix>
Options:
  -f         Use <out.prefix> as full final filename rather than prefix
  -o         Write final output to stdout rather than <out.prefix>.bam
  -l,m,n,@   Similar to corresponding options above

In most cases you will be sorting a BAM file by position rather than by name. You can use either -o or reidrection with > to control the output.

To sort the paired-end yeast BAM file by coordinate, and get a BAM file named cholera_rnaseq.sort.bam as output, execute the following command:

Code Block
languagebash
samtools sort yeast_pairedend.sam > yeast_pairedend.sort.bam

How do you look at the BAM file contents now? That's simple. Just use samtools view. (Remember to pipe output to a pager!)

Code Block
languagebash
samtools view yeast_pairedend.sort.bam | more

Samtools index

Many tools (like the UCSC Genome Browser) only need to use sub-sections of the BAM file at a given point in time.  For For example, if you are viewing all alignments that are within a particular gene , than the alignments on other chromosomes generally do not need to be loaded.  In In order to speed up access, BAI BAM files are BAM index files that indexed, producing BAI files which allow other programs to navigate directly to the alignments of interest.  This This is especially important when you have many alignments.  The

The utility samtools index directly creates an index that has the exact name as the input BAM file, with 'suffix .bai' appended.  The The help page, if you execute samtools index with no arguments, is as follows:

...

Here, the syntax is way, way easier.  We want a BAI-format index (CSI-format is used with extremely long contigs, which we aren't considering here - the most common use case are highly polyploid plant genomes), which is the default, so all we have to type is:

Code Block
languagebash
samtools index yeast_pairedend.bam

This will produce a file named yeast_pairedend.bam.bai.

Most  Most of the time when an index is required, it will be automatically located provided as long as it is in the same directory as the its BAM file that it was produced from, and shares the same name up until the 'the .bai'' extension.

Samtools idxstats

...

Finally, we might like to obtain some other statistics, such as the percent of all reads that aligned to the genome.  The The samtools flagstat tool provides very simple analysis of the SAM flag fields, which includes information like whether reads are properly paired, aligned or not, and a few other things. Its syntax is identical to that of samtools idxstats:

...