Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleMake sure you're in a idev session


Code Block
languagebash
titleStart an idev session
idev -m 120 -N 1 -A OTH21164 -r CoreNGSday4CoreNGS-Thu
# or
idev -m 90 -N 1 -A OTH21164 -p development



Code Block
languagebash
# If not already loaded
module load biocontainers  # takes a while

module load samtools
samtools

...

In this exercise, we will explore five utilities provided by samtools: view, sort, index, flagstat, and idxstats. Each of these is executed in one line for a given SAM/BAM file. In the SAMtools/BEDtools sections tomorrow we will explore samtools in capabilities more in depth.

Warning
titleKnow your samtools version!

There are two main "eras" of SAMtools development:

  • "original" samtools
    • v 0.1.19 is the last stable version
  • "modern" samtools
    • v 1.0, 1.1, 1.2 – avoid these (very buggy!)
    • v 1.3+ – finally stable!

Unfortunately, some functions with the same name in both version eras have different options and arguments! So be sure you know which version you're using. (The samtools version is usually reported at the top of its usage listing).

TACC BioContainers also offers the original samtools version: samtools/ctr-0.1.19--3.

...

  • the -b option tells the tool to output BAM format

How The BAM file is a binary file, not a text file, so how do you look at the BAM file its contents now? That's simple. Just use samtools view without the -b option. Remember to pipe output to a pager!

...

Exercise: What samtools view option will include the header records in its output? Which option would show only the header records?

Expand
titleHint

samtools view | less

then search for "header" ( /header )


Expand
titleAnswer

samtools view -h shows header records along with alignment records.

samtools view -H shows header records only.

...

Looking at some of the alignment record information (e.g. samtools view yeast_pairedendpe.bam | cut -f 1-4 | more), you will notice that read names appear in adjacent pairs (for the R1 and R2), in the same order they appeared in the original FASTQ file. Since that means the corresponding mappings are in no particular order, searching through the file very inefficient. samtools sort re-orders entries in the SAM file either by locus (contig name + coordinate position) or by read name.

...

Expand
titleHint


Code Block
languagebash
ls -lh yeast_pe.*



Expand
titleAnswer

The yeast_pe.sam text file is the largest at ~348 MB because it is an uncompressed text file.

The name-ordered binary yeast_pe.bam text file only about 1/3 that size, ~111 MB. They contain exactly the same records, in the same order, but conversion from text to binary results in a much smaller file.

The coordinate-ordered binary yeast_pe.sort.bam file is even slightly smaller, ~92 MB. This is because BAM files are actually customized gzip-format files. The customization allows blocks of data (e.g. all alignment records for a contig) to be represented in an even more compact form. You can read more about this in section 4 of the SAM format specification.

...