Page History

...

First stage the sample datasets and references we will use.

Code Block

language	bash
title	Get the alignment exercises files

mkdir -p $SCRATCH/core_ngs/alignment/fastq
mkdir -p $SCRATCH/core_ngs/references/fasta
cp $CORENGS/alignment/*fastq.gz $SCRATCH/core_ngs/alignment/fastq/
cp $CORENGS/references/*.* $SCRATCH/core_ngs/references/fasta/

...

Code Block

language	bash

module load bwa
bwa

code

Expandcode

title	Top-level BWA help

BWA suite usage

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.12-r1039
Contact: Heng Li <lh3@sanger.ac.uk>
Usage:   bwa <command> [options]
Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         shm           manage indices in shared memory
         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
      There are three alignment algorithms in BWA: `mem', `bwasw', and
      `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
      first. Please `man ./bwa.1' for the manual.

As you can see, bwa include many sub-commands that perform the tasks we are interested in.

...

We will index the genome with the bwa index command. Type bwa index with no arguments to see usage for this sub-command.

Code Block

title	bwa index usage

Usage:   bwa index [-a bwtsw|is] [-c] <in.fasta>
Options: -a STR    BWT construction algorithm: bwtsw or is [auto]
         -p STR    prefix of the index [same as fasta name]
		 -b INT	   block size for the bwtsw algorithm (effective with -a bwtsw) [10000000]
         -6        index files named as <in.fasta>.64.* instead of <in.fasta>.*
Warning: `-a bwtsw' does not work for short genomes, while `-a is' and
         `-a div' do not work not for long genomes. Please choose `-a'
         according to the length of the genome.

...

Tip
You can tell you're in a idev session because the hostname command will return a compute node name (e.g. nid00438) instead of a login node name (e.g. login5).

...

Expand

title	Answer

The last few lines of bwa's execution output should look something like this:

Code Block

language	bash

[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 592180 sequences have been processed.
[main] Version: 0.7.12-r1039
[main] CMD: bwa aln -t 20 sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R2.cat.fastq.gz
[main] Real time: 21.157 sec; CPU: 268.813 sec

So the R2 alignment took only 21 seconds, or about 10 times as fast as with only one processing thread.

...

Code Block

language	bash
title	Pairing of BWA global alignment of R1 and R2 aligned reads

bwa sampe sacCer3/sacCer3.fa yeast_R1.sai yeast_R2.sai fastq/Sample_Yeast_L005_R1.cat.fastq.gz fastq/Sample_Yeast_L005_R2.cat.fastq.gz > yeast_pairedend.sam

...

Code Block

language	bash

module load samtools
samtools

Code Block

title	samtools SAMtools suite usage

Program: samtools (Tools for alignments in the SAM format)
Version: 1.3.1 (using htslib 1.3.1)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     rmdup          remove PCR duplicates
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

...

That is a lot to process! For now, we just want to read in a SAM file and output a BAM file. The input format is auto-detected, so we don't need to specify it (although you do in v0.1.19). We just need to tell the tool to output the file in BAM format.

Code Block

language	bash
title	Convert SAM to binary BAM

cd $SCRATCH/core_ngs/alignment/yeast_bwa
samtools view -b -o yeast_pairedend.bam yeast_pairedend.sam

...

How do you look at the BAM file contents now? That's simple. Just use samtools view without the -b option. Remember to pipe output to a pager!

Code Block

language	bash
title	View BAM records

samtools view yeast_pairedend.bam | more

...

To sort the paired-end yeast BAM file by coordinate, and get a BAM file named yeast_pairedend.sort.bam as output, execute the following command:

Code Block

language	bash
title	Sort a BAM file

cd $SCRATCH/core_ngs/alignment/yeast_bwa
samtools sort -O bam -T yeast_pairedend.tmp yeast_pairedend.bam > yeast_pairedend.sort.bam

...

Exercise: Compare the file sizes of the yeast_pariedend .sam, .bam, and .sort.bam files and explain why they are different.

Expand

title	Hint

Code Block

language	bash

ls -lh yeast_pairedend*

Expand

title	Answer

The yeast_pairedend.sam text file is the largest at ~348 MB.

The name-ordered binary yeast_pairedend.bam text file only about 1/3 that size, ~110 MB. They contain exactly the same records, in the same order, but conversion from text to binary results in a much smaller file.

The coordinate-ordered binary yeast_pairedend.sort.bam file is even slightly smaller, ~91 MB. This is because BAM files are actually customized gzip-format files. The customization allows blocks of data (e.g. all alignment records for a contig) to be represented in a very an even more compact form. You can read more about this in section 4 of https://samtools.github.io/hts-specs/SAMv1.pdf.

...

the SAM format specification.

samtools index

Many tools (like the UCSC Genome Browser) only need to use sub-sections portions of the a BAM file at a given point in time. For example, if you are viewing alignments that are within a particular gene alignments , alignment records on other chromosomes generally do not need to be loaded. In order to speed up access, BAM files are indexed, producing BAI files which allow other programs to navigate directly to the alignments of interestfast random access. This is especially important when you have many alignmentsalignment records.

The utility samtools index creates an index that has the exact same name as the input BAM file, with suffix .bai appended. The help page, if you execute samtools index with no arguments, is as followsHere's the samtools index usage:

Code Block

title	samtools index usage

Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
  -b       Generate BAI-format index for BAM files [default]
  -c       Generate CSI-format index for BAM files
  -m INT   Set minimum interval size for CSI indices to 2^INT [14]

Here, the The syntax here is way, way easier. We want a BAI-format index which is the default. (CSI-format is used with extremely long contigs, which we arendon't considering apply here - the most common use case are highly is for polyploid plant genomes).

So all we have to type is:

Code Block

language	bash
title	Index a sorted bam

samtools index yeast_pairedend.sort.bam

This will produce a file named yeast_pairedend.bam.bai.

Most of the time when an index is required, it will be automatically located as long as it is in the same directory as its BAM file and shares the same name up until the .bai extension.

Samtools idxstats

Now that we have a sorted, indexed BAM file, we might like to get some simple statistics about the alignment run. For example, we might like to know how many reads aligned to each chromosome/contig. The samtools idxstats is a very simple tool that provides this information. If you type the command without any arguments, you will see that it could not be simpler - just type the following command:

Exercise: Compare the sizes of the sorted BAM file and its BAI index.

Expand

title	Hint

Code Block

language	bash

ls -lh

Code Block

language	bash

samtools idxstats

 yeast_pairedend.sort.bam

The output is a text file with four tab-delimited columns with the following meanings:

chromosome name
chromosome length
number of mapped reads
number of unmapped reads

The reason that the "unmapped reads" field for the named chromosomes is not zero is that, if one half of a pair of reads aligns while the other half does not, the unmapped read is still assigned to the chromosome its mate mapped to, but is flagged as unmapped.

Tip
If you're mapping to a non-genomic reference such as miRBase miRNAs or another set of genes (a transcriptome), samtools idxstats gives you a quick look at quantitative alignment results.

Samtools flagstat

Expand

title	Answer

While the yeast_pairedend.sort.bam text file is ~91 MB, its index (yeast_pairedend.sort.bai) is only 20 KB.

samtools flagstat

Finally, we might like to obtain some other statistics, such as the percent of all reads that aligned to the genome. The samtools flagstat tool provides very simple analysis of the SAM flag fields, which includes information like whether reads are properly paired, aligned or not, and a few other things. Its syntax is identical to that of samtools idxstats:

...

The most important statistic is the mapping rate, but this readout allows you to verify that some common expectations (e.g. that about the same number of R1 and R2 reads aligned, and that most mapped reads are proper pairs) are met.

samtools idxstats

Now that we have a sorted, indexed BAM file, we might like to get some simple statistics about the alignment run. For example, we might like to know how many reads aligned to each chromosome/contig. The samtools idxstats is a very simple tool that provides this information. If you type the command without any arguments, you will see that it could not be simpler - just type the following command:

Code Block

language	bash

samtools idxstats yeast_pairedend.sort.bam

The output is a text file with four tab-delimited columns with the following meanings:

chromosome name
chromosome length
number of mapped reads
number of unmapped reads

The reason that the "unmapped reads" field for the named chromosomes is not zero is that, if one half of a pair of reads aligns while the other half does not, the unmapped read is still assigned to the chromosome its mate mapped to, but is flagged as unmapped.

Tip
If you're mapping to a non-genomic reference such as miRBase miRNAs or another set of genes (a transcriptome), samtools idxstats gives you a quick look at quantitative alignment results.

Exercise #3: Yeast BWA PE alignment with BioITeam alignment script

...

Page tree

Versions Compared

Old Version 137

New Version 138

Key

samtools index

Samtools idxstats

Samtools flagstat

samtools flagstat

samtools idxstats

Exercise #3: Yeast BWA PE alignment with BioITeam alignment script