...
- Prepare the vibCho reference index for bowtie2 from GenBank files using BioPerl
- Align reads using bowtie2, producing a SAM file
- Convert the SAM file to a BAM file (samtools view)
- Sort the BAM file by genomic location (samtools sort)
- Index the BAM file (samtools index)
- Gather simple alignment statistics (samtools flagstat and samtools idxstat)
Converting GenBank
...
records into
...
sequence (FASTA
...
) and
...
annotation (GFF) files
...
As noted earlier, many microbial genomes are available through repositories like GenBank that use specific file format conventions for storage and distribution of genome sequence and annotations. The "Genbank" file format is a text file that can be parsed to yield other files that are compatible with the pipelines we have been implementing. Go ahead and look at some of the contents of a Genbank file with the following commands:
...
For the sake of time and simplicity, here we are only going to run these commands on the yeast paired-end alignment file. However, the exact same commands can be run on the other files by changing the names, so feel free to try executing the same commands on other SAM files. Indeed, it is very common in practice to use bash loops to generate many commands for a large set of alignments and deposit those commands into a batch job cmds file for submission.
Samtools Sort
Samtools View
Samtools Index
Samtools Idxstats
Samtools Flagstat
To start, we will move to the directory containing our SAM files, among other things, and load up samtools using the module system. After loading it, just run the samtools command to see what the available tools are (and to see what the syntax of an actual command is).
Code Block |
---|
cd $SCRATCH/core_ngs/alignment
ls -la
module load samtools
samtools |
You will see the following screen after running samtools with no other options:
Code Block |
---|
Program: samtools (Tools for alignments in the SAM format)
Version: 1.2 (using htslib 1.2.1)
Usage: samtools <command> [options]
Commands:
-- indexing
faidx index/extract FASTA
index index alignment
-- editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
rmdup remove PCR duplicates
targetcut cut fosmid regions (for fosmid pool only)
-- file operations
bamshuf shuffle and group alignments by name
cat concatenate BAMs
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
bam2fq converts a BAM to a FASTQ
-- stats
bedcov read depth per BED region
depth compute the depth
flagstat simple stats
idxstats BAM index stats
phase phase heterozygotes
stats generate stats (former bamcheck)
-- viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion |
NOTE: The most recent edition of SAMtools is 1.2, which has some important differences from the last version, 0.1.19. Everything for this section is the same between the two versions, but if you see code from other sources using samtools, the version difference may be important.
Samtools Sort
Look at the SAM file briefly using less. You will notice, if you scroll down, that the alignments are in no particular order, with chromosomes and start positions all mixed up. This makes searching through the file very inefficient. Thus, samtools sort is a piece of samtools that provides the ability to re-order entries in the SAM file either by coordinate or by read name. If you execute samtools sort without any options, you will see its help page as follows:
Code Block |
---|
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-n Sort by read name
-o FILE Write final output to FILE rather than standard output
-O FORMAT Write output as FORMAT ('sam'/'bam'/'cram') (either -O or
-T PREFIX Write temporary files to PREFIX.nnnn.bam -T is required)
-@ INT Set number of sorting and compression threads [1]
Legacy usage: samtools sort [options...] <in.bam> <out.prefix>
Options:
-f Use <out.prefix> as full final filename rather than prefix
-o Write final output to stdout rather than <out.prefix>.bam
-l,m,n,@ Similar to corresponding options above |
In most cases, you would want to sort the file by position rather than by name. The required options are -O and -T, with -o being useful to specify the output file (either -o or '>' can be used to direct the output to the proper location). So, to sort the paired-end yeast SAM file by coordinate, and get a SAM file named cholera_rnaseq.sort.sam out as output, we execute the following command, and use less to evaluate the output:
Code Block |
---|
samtools sort -O sam -T yeast_pairedend -o yeast_pairedend.sort.sam yeast_pairedend.sam
less yeast_pairedend.sort.sam |
Samtools View
You may have noticed in the last help page that samtools sort can specify a BAM file as input or output, which is the smaller, binary form of a SAM file. This is a viable option if the file needs sorting - however, in many cases you may just want to compress a SAM file by conversion to BAM without any modifications. The utility samtools view provides a way of converting SAM files to BAM files directly. It also provides many, many other functions which we will discuss in the next section. To get a preview, execute samtools view without any other arguments. You will see:
Code Block |
---|
Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]
Options: -b output BAM
-C output CRAM (requires -T)
-1 use fast BAM compression (implies -b)
-u uncompressed BAM output (implies -b)
-h include header in SAM output
-H print SAM header only (no alignments)
-c print only the count of matching records
-o FILE output file name [stdout]
-U FILE output reads not selected by filters to FILE [null]
-t FILE FILE listing reference names and lengths (see long help) [null]
-T FILE reference sequence FASTA FILE [null]
-L FILE only include reads overlapping this BED FILE [null]
-r STR only include reads in read group STR [null]
-R FILE only include reads with read group listed in FILE [null]
-q INT only include reads with mapping quality >= INT [0]
-l STR only include reads in library STR [null]
-m INT only include reads with number of CIGAR operations
consuming query sequence >= INT [0]
-f INT only include reads with all bits set in INT set in FLAG [0]
-F INT only include reads with none of the bits set in INT
set in FLAG [0]
-x STR read tag to strip (repeatable) [null]
-B collapse the backward CIGAR operation
-s FLOAT integer part sets seed of random number generator [0];
rest sets fraction of templates to subsample [no subsampling]
-@ INT number of BAM compression threads [0]
-? print long help, including note about region specification
-S ignored (input format is auto-detected) |
That is a lot to process! For now, we just want to read in a SAM file and output a BAM file. The input format is auto-detected, so we don't need to say that we're inputing a SAM instead of a BAM. We just need to tell the tool to output the file in BAM format, and provide the name of the destination BAM file. This command is as follows:
Code Block |
---|
samtools view -b yeast_pairedend.sort.sam -o yeast_pairedend.bam |
Above, the -b option tells the tool to output BAM, and the -o option specifies the name of the BAM file that will be created. All the other options have their uses, but we will not discuss them right now. It is worth noting, however, that if you wanted to convert back from BAM to SAM to read some alignments, you would simply remove the -b option and adjust the file names accordingly.
Samtools Index
Many tools (like the UCSC Genome Browser) only need to use sub-sections of the BAM file at a given point in time. For example, if you are viewing all alignments that are within a particular gene, than the alignments on other chromosomes generally do not need to be loaded. In order to speed up access, BAI files are BAM index files that allow other programs to navigate directly to the alignments of interest. This is especially important when you have many alignments. The utility samtools index directly creates an index that has the exact name as the input file, with '.bai' appended. The help page, if you execute samtools index with no arguments, is as follows:
Code Block |
---|
Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
-b Generate BAI-format index for BAM files [default]
-c Generate CSI-format index for BAM files
-m INT Set minimum interval size for CSI indices to 2^INT [14] |
Here, the syntax is way, way easier. We want a BAI-format index (CSI-format is used with extremely long contigs, which we aren't considering here - the most common use case are highly polyploid plant genomes), which is the default, so all we have to type is:
Code Block |
---|
samtools index yeast_pairedend.bam |
This will produce a file named yeast_pairedend.bam.bai. Most of the time when an index is required, it will be automatically located provided it is in the same directory as the BAM file that it was produced from, and shares the same name up until the '.bai'' extension.
Samtools Idxstats
Now that we have a sorted, compressed, and indexed BAM file, we might like to get some simple statistics about the alignment run. For example, we might like to know how many reads aligned to each chromosome/contig. The samtools idxstats is a very simple tool that provides this information. If you type the command without any arguments, you will see that it literally could not be simpler - just type the following command:
Code Block |
---|
samtools idxstats yeast_pairedend.bam |
The output is a text file with four tab-delimited columns with the following meanings: (1) chromosome name, (2) chromosome length, (3) number of mapped reads, and (4) number of unmapped reads. The reason that the "unmapped reads" field for the named chromosomes is not zero is that, if one half of a pair of reads aligns while the other half does not, the unmapped read is still assigned to the chromosome its pair mapped to, but still flagged as unmapped.
Samtools Flagstat
Finally, we might like to obtain some other statistics, such as the percent of all reads that aligned to the genome. The samtools flagstat tool provides very simple analysis of the SAM flag fields, which includes information like whether reads are properly paired, aligned or not, and a few other things. Its syntax is identical to that of samtools idxstats:
Code Block |
---|
samtools flagstat yeast_pairedend.bam |
Ignore the "+ 0" addition to each line - that is a carried-over convention that is no longer necessary. The most important statistic is arguably alignment rate, but this readout allows you to verify that some common expectations (e.g. that about the same number of R1 and R2 reads aligned, and that most mapped reads are proper pairs) are met.