Page History

...

Prepare the vibCho reference index for bowtie2 from GenBank files using BioPerl
Align reads using bowtie2, producing a SAM file
Convert the SAM file to a BAM file (samtools view)
Sort the BAM file by genomic location (samtools sort)
Index the BAM file (samtools index)
Gather simple alignment statistics (samtools flagstat and samtools idxstat)

Converting GenBank

...

records into

...

sequence (FASTA

...

) and

...

annotation (GFF) files

...

As noted earlier, many microbial genomes are available through repositories like GenBank that use specific file format conventions for storage and distribution of genome sequence and annotations. The "Genbank" file format is a text file that can be parsed to yield other files that are compatible with the pipelines we have been implementing. Go ahead and look at some of the contents of a Genbank file with the following commands:
...
For the sake of time and simplicity, here we are only going to run these commands on the yeast paired-end alignment file. However, the exact same commands can be run on the other files by changing the names, so feel free to try executing the same commands on other SAM files. Indeed, it is very common in practice to use bash loops to generate many commands for a large set of alignments and deposit those commands into a batch job cmds file for submission.

Samtools Sort

Samtools View

Samtools Index

Samtools Idxstats

Samtools Flagstat

To start, we will move to the directory containing our SAM files, among other things, and load up samtools using the module system. After loading it, just run the samtools command to see what the available tools are (and to see what the syntax of an actual command is).

Code Block
cd $SCRATCH/core_ngs/alignment ls -la module load samtools samtools

You will see the following screen after running samtools with no other options:

Code Block

Program: samtools (Tools for alignments in the SAM format)
Version: 1.2 (using htslib 1.2.1)
Usage:   samtools <command> [options]
Commands:
  -- indexing
         faidx       index/extract FASTA
         index       index alignment
  -- editing
         calmd       recalculate MD/NM tags and '=' bases
         fixmate     fix mate information
         reheader    replace BAM header
         rmdup       remove PCR duplicates
         targetcut   cut fosmid regions (for fosmid pool only)
  -- file operations
         bamshuf     shuffle and group alignments by name
         cat         concatenate BAMs
         merge       merge sorted alignments
         mpileup     multi-way pileup
         sort        sort alignment file
         split       splits a file by read group
         bam2fq      converts a BAM to a FASTQ
  -- stats
         bedcov      read depth per BED region
         depth       compute the depth
         flagstat    simple stats
         idxstats    BAM index stats
         phase       phase heterozygotes
         stats       generate stats (former bamcheck)
  -- viewing
         flags       explain BAM flags
         tview       text alignment viewer
         view        SAM<->BAM<->CRAM conversion

NOTE: The most recent edition of SAMtools is 1.2, which has some important differences from the last version, 0.1.19. Everything for this section is the same between the two versions, but if you see code from other sources using samtools, the version difference may be important.

Samtools Sort

Look at the SAM file briefly using less. You will notice, if you scroll down, that the alignments are in no particular order, with chromosomes and start positions all mixed up. This makes searching through the file very inefficient. Thus, samtools sort is a piece of samtools that provides the ability to re-order entries in the SAM file either by coordinate or by read name. If you execute samtools sort without any options, you will see its help page as follows:

Code Block

Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -o FILE    Write final output to FILE rather than standard output
  -O FORMAT  Write output as FORMAT ('sam'/'bam'/'cram')   (either -O or
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam       -T is required)
  -@ INT     Set number of sorting and compression threads [1]
Legacy usage: samtools sort [options...] <in.bam> <out.prefix>
Options:
  -f         Use <out.prefix> as full final filename rather than prefix
  -o         Write final output to stdout rather than <out.prefix>.bam
  -l,m,n,@   Similar to corresponding options above

In most cases, you would want to sort the file by position rather than by name. The required options are -O and -T, with -o being useful to specify the output file (either -o or '>' can be used to direct the output to the proper location). So, to sort the paired-end yeast SAM file by coordinate, and get a SAM file named cholera_rnaseq.sort.sam out as output, we execute the following command, and use less to evaluate the output:

Code Block
samtools sort -O sam -T yeast_pairedend -o yeast_pairedend.sort.sam yeast_pairedend.sam less yeast_pairedend.sort.sam

Samtools View

You may have noticed in the last help page that samtools sort can specify a BAM file as input or output, which is the smaller, binary form of a SAM file. This is a viable option if the file needs sorting - however, in many cases you may just want to compress a SAM file by conversion to BAM without any modifications. The utility samtools view provides a way of converting SAM files to BAM files directly. It also provides many, many other functions which we will discuss in the next section. To get a preview, execute samtools view without any other arguments. You will see:

Code Block

Usage:   samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]
Options: -b       output BAM
         -C       output CRAM (requires -T)
         -1       use fast BAM compression (implies -b)
         -u       uncompressed BAM output (implies -b)
         -h       include header in SAM output
         -H       print SAM header only (no alignments)
         -c       print only the count of matching records
         -o FILE  output file name [stdout]
         -U FILE  output reads not selected by filters to FILE [null]
         -t FILE  FILE listing reference names and lengths (see long help) [null]
         -T FILE  reference sequence FASTA FILE [null]
         -L FILE  only include reads overlapping this BED FILE [null]
         -r STR   only include reads in read group STR [null]
         -R FILE  only include reads with read group listed in FILE [null]
         -q INT   only include reads with mapping quality >= INT [0]
         -l STR   only include reads in library STR [null]
         -m INT   only include reads with number of CIGAR operations
                  consuming query sequence >= INT [0]
         -f INT   only include reads with all bits set in INT set in FLAG [0]
         -F INT   only include reads with none of the bits set in INT
                  set in FLAG [0]
         -x STR   read tag to strip (repeatable) [null]
         -B       collapse the backward CIGAR operation
         -s FLOAT integer part sets seed of random number generator [0];
                  rest sets fraction of templates to subsample [no subsampling]
         -@ INT   number of BAM compression threads [0]
         -?       print long help, including note about region specification
         -S       ignored (input format is auto-detected)

That is a lot to process! For now, we just want to read in a SAM file and output a BAM file. The input format is auto-detected, so we don't need to say that we're inputing a SAM instead of a BAM. We just need to tell the tool to output the file in BAM format, and provide the name of the destination BAM file. This command is as follows:

Code Block
samtools view -b yeast_pairedend.sort.sam -o yeast_pairedend.bam

Above, the -b option tells the tool to output BAM, and the -o option specifies the name of the BAM file that will be created. All the other options have their uses, but we will not discuss them right now. It is worth noting, however, that if you wanted to convert back from BAM to SAM to read some alignments, you would simply remove the -b option and adjust the file names accordingly.

Samtools Index

Many tools (like the UCSC Genome Browser) only need to use sub-sections of the BAM file at a given point in time. For example, if you are viewing all alignments that are within a particular gene, than the alignments on other chromosomes generally do not need to be loaded. In order to speed up access, BAI files are BAM index files that allow other programs to navigate directly to the alignments of interest. This is especially important when you have many alignments. The utility samtools index directly creates an index that has the exact name as the input file, with '.bai' appended. The help page, if you execute samtools index with no arguments, is as follows:

Code Block

Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
  -b       Generate BAI-format index for BAM files [default]
  -c       Generate CSI-format index for BAM files
  -m INT   Set minimum interval size for CSI indices to 2^INT [14]

Here, the syntax is way, way easier. We want a BAI-format index (CSI-format is used with extremely long contigs, which we aren't considering here - the most common use case are highly polyploid plant genomes), which is the default, so all we have to type is:

Code Block
samtools index yeast_pairedend.bam

This will produce a file named yeast_pairedend.bam.bai. Most of the time when an index is required, it will be automatically located provided it is in the same directory as the BAM file that it was produced from, and shares the same name up until the '.bai'' extension.

Samtools Idxstats

Now that we have a sorted, compressed, and indexed BAM file, we might like to get some simple statistics about the alignment run. For example, we might like to know how many reads aligned to each chromosome/contig. The samtools idxstats is a very simple tool that provides this information. If you type the command without any arguments, you will see that it literally could not be simpler - just type the following command:

Code Block
samtools idxstats yeast_pairedend.bam

The output is a text file with four tab-delimited columns with the following meanings: (1) chromosome name, (2) chromosome length, (3) number of mapped reads, and (4) number of unmapped reads. The reason that the "unmapped reads" field for the named chromosomes is not zero is that, if one half of a pair of reads aligns while the other half does not, the unmapped read is still assigned to the chromosome its pair mapped to, but still flagged as unmapped.

Samtools Flagstat

Finally, we might like to obtain some other statistics, such as the percent of all reads that aligned to the genome. The samtools flagstat tool provides very simple analysis of the SAM flag fields, which includes information like whether reads are properly paired, aligned or not, and a few other things. Its syntax is identical to that of samtools idxstats:

Code Block
samtools flagstat yeast_pairedend.bam

Ignore the "+ 0" addition to each line - that is a carried-over convention that is no longer necessary. The most important statistic is arguably alignment rate, but this readout allows you to verify that some common expectations (e.g. that about the same number of R1 and R2 reads aligned, and that most mapped reads are proper pairs) are met.

Page tree

Versions Compared

Old Version 81

New Version 82

Key

Converting GenBank

records into

sequence (FASTA

) and

annotation (GFF) files

Samtools Sort

Samtools View

Samtools Index

Samtools Idxstats

Samtools Flagstat

Samtools Sort

Samtools Index

Samtools Flagstat