Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We will execute these commands directly (not in a batch job), but since they are fairly large files we will first set up an interactive development (idev) session, which will give us a compute node for a couple of 3 hours:

Code Block
languagebash
titleStart an idev session
idev -p normal -m 120180 -N 1 -n 24 -A UT-2015-05-18 --reservation=CCBB

...

Expand
titleHint

This looks for the pattern  '^HWI' which is the start of every read name (which starts every alignment record).
Remember -c says just count the records, don't display them.

Code Block
languagebash
grep -P -c '^HWI' yeast_pairedend.sam 

Or use the -v (invert) option to tell grep to print all lines that don't match a particular pattern; here, all the header lines that start with @.

Code Block
languagebash
grep -P -v -c '^@' yeast_pairedend.sam


...

Warning
titleKnow your samtools version!

There are two main "eras" of SAMtools development:

  • "original" samtools
    • v 0.1.19 is the last stable version
  • "new" samtools
    • v 1.0, 1.1, 1.2 – avoid these (very buggy!)
    • v 1.3+ – finally stable!

Unfortunately, some functions with the same name in both version eras have different argumentoptions and arguments! So be sure you know which version you're using. (The samtools version is usually reported at the top of its usage listing).

The default version in the ls5 module system is 1.3.1, but the BioITeam has a copy of the version 0.1.19 samtools for programs that might need it: /work/projects/BioITeam/ls5/bin/samtools-0.1.19.

...

  • The -O options says the output format should be BAM
  • The -T options gives a prefix for temporary files produced during sorting
    • sorting large BAMs will produce many temporary files during processing
  • By default sort writes its output to standard output, so we use > to redirect to a file named yeast_pairedend.sort.bam

Exercise: Compare the file sizes of the yeast_pariedend .sam, .bam, and .sort.bam

...

files and explain why they are different.

Expand
titleHint

ls -lh yeast_pairedend*

Expand
titleAnswer

The yeast_pairedend.sam text file is the largest at ~348 MB.

The name-ordered binary yeast_pairedend.bam text file only about 1/3 that size, ~110 MB. They contain exactly the same records, in the same order, but conversion from text to binary results in a much smaller file.

The coordinate-ordered binary yeast_pairedend.sort.bam file is even slightly smaller, ~91 MB. This is because BAM files are actually customized gzip-format files. The customization allows blocks of data (e.g. all alignment records for a contig) to be represented in a very compact. You can read more about this in section 4 of https://samtools.github.io/hts-specs/SAMv1.pdf.

x

samtools

...

index

Many tools (like the UCSC Genome Browser) only need to use sub-sections of the BAM file at a given point in time. For example, if you are viewing alignments that are within a particular gene alignments on other chromosomes generally do not need to be loaded. In order to speed up access, BAM files are indexed, producing BAI files which allow other programs to navigate directly to the alignments of interest. This is especially important when you have many alignments.

...