...
We will execute these commands directly (not in a batch job), but since they are fairly large files we will first set up an interactive development (idev) session, which will give us a compute node for a couple of 3 hours:
Code Block | ||||
---|---|---|---|---|
| ||||
idev -p normal -m 120180 -N 1 -n 24 -A UT-2015-05-18 --reservation=CCBB |
...
Expand | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
This looks for the pattern '^HWI' which is the start of every read name (which starts every alignment record).
Or use the -v (invert) option to tell grep to print all lines that don't match a particular pattern; here, all the header lines that start with @.
|
...
Warning | ||
---|---|---|
| ||
There are two main "eras" of SAMtools development:
Unfortunately, some functions with the same name in both version eras have different argumentoptions and arguments! So be sure you know which version you're using. (The samtools version is usually reported at the top of its usage listing). The default version in the ls5 module system is 1.3.1, but the BioITeam has a copy of the version 0.1.19 samtools for programs that might need it: /work/projects/BioITeam/ls5/bin/samtools-0.1.19. |
...
- The -O options says the output format should be BAM
- The -T options gives a prefix for temporary files produced during sorting
- sorting large BAMs will produce many temporary files during processing
- By default sort writes its output to standard output, so we use > to redirect to a file named yeast_pairedend.sort.bam
Exercise: Compare the file sizes of the yeast_pariedend .sam, .bam, and .sort.bam
...
files and explain why they are different.
Expand | ||
---|---|---|
| ||
ls -lh yeast_pairedend* |
Expand | ||
---|---|---|
| ||
The yeast_pairedend.sam text file is the largest at ~348 MB. The name-ordered binary yeast_pairedend.bam text file only about 1/3 that size, ~110 MB. They contain exactly the same records, in the same order, but conversion from text to binary results in a much smaller file. The coordinate-ordered binary yeast_pairedend.sort.bam file is even slightly smaller, ~91 MB. This is because BAM files are actually customized gzip-format files. The customization allows blocks of data (e.g. all alignment records for a contig) to be represented in a very compact. You can read more about this in section 4 of https://samtools.github.io/hts-specs/SAMv1.pdf. |
x
samtools
...
index
Many tools (like the UCSC Genome Browser) only need to use sub-sections of the BAM file at a given point in time. For example, if you are viewing alignments that are within a particular gene alignments on other chromosomes generally do not need to be loaded. In order to speed up access, BAM files are indexed, producing BAI files which allow other programs to navigate directly to the alignments of interest. This is especially important when you have many alignments.
...