Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

First stage the sample datasets and references we will use.

Code Block
languagebash
titleGet the alignment exercises files
mkdir -p $SCRATCH/core_ngs/alignment/fastq
mkdir -p $SCRATCH/core_ngs/references/fasta
cp $CORENGS/alignment/*fastq.gz $SCRATCH/core_ngs/alignment/fastq/
cp $CORENGS/references/*.* $SCRATCH/core_ngs/references/fasta/

...

Code Block
languagebash
module load bwa
bwa
code
Expandcode
titleTop-level BWA help
BWA suite usage
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.12-r1039
Contact: Heng Li <lh3@sanger.ac.uk>
Usage:   bwa <command> [options]
Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         shm           manage indices in shared memory
         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
      There are three alignment algorithms in BWA: `mem', `bwasw', and
      `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
      first. Please `man ./bwa.1' for the manual.

As you can see, bwa include many sub-commands that perform the tasks we are interested in.

...

We will index the genome with the bwa index command. Type bwa index with no arguments to see usage for this sub-command.

Code Block
titlebwa index usage
Usage:   bwa index [-a bwtsw|is] [-c] <in.fasta>
Options: -a STR    BWT construction algorithm: bwtsw or is [auto]
         -p STR    prefix of the index [same as fasta name]
		 -b INT	   block size for the bwtsw algorithm (effective with -a bwtsw) [10000000]
         -6        index files named as <in.fasta>.64.* instead of <in.fasta>.*
Warning: `-a bwtsw' does not work for short genomes, while `-a is' and
         `-a div' do not work not for long genomes. Please choose `-a'
         according to the length of the genome.

...

Tip

You can tell you're in a idev session because the hostname command will return a compute node name (e.g. nid00438) instead of a login node name (e.g. login5).

...

Expand
titleAnswer

The last few lines of bwa's execution output should look something like this:

Code Block
languagebash
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 592180 sequences have been processed.
[main] Version: 0.7.12-r1039
[main] CMD: bwa aln -t 20 sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R2.cat.fastq.gz
[main] Real time: 21.157 sec; CPU: 268.813 sec

So the R2 alignment took only 21 seconds, or about 10 times as fast as with only one processing thread.

...

Code Block
languagebash
titlePairing of BWA global alignment of R1 and R2 aligned reads
bwa sampe sacCer3/sacCer3.fa yeast_R1.sai yeast_R2.sai fastq/Sample_Yeast_L005_R1.cat.fastq.gz fastq/Sample_Yeast_L005_R2.cat.fastq.gz > yeast_pairedend.sam

...

Code Block
languagebash
module load samtools
samtools
Code Block
titlesamtools SAMtools suite usage
Program: samtools (Tools for alignments in the SAM format)
Version: 1.3.1 (using htslib 1.3.1)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     rmdup          remove PCR duplicates
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

...

That is a lot to process! For now, we just want to read in a SAM file and output a BAM file. The input format is auto-detected, so we don't need to specify it (although you do in v0.1.19). We just need to tell the tool to output the file in BAM format.

Code Block
languagebash
titleConvert SAM to binary BAM
cd $SCRATCH/core_ngs/alignment/yeast_bwa
samtools view -b -o yeast_pairedend.bam yeast_pairedend.sam 

...

How do you look at the BAM file contents now? That's simple. Just use samtools view without the -b option. Remember to pipe output to a pager!

Code Block
languagebash
titleView BAM records
samtools view yeast_pairedend.bam | more

...

To sort the paired-end yeast BAM file by coordinate, and get a BAM file named yeast_pairedend.sort.bam as output, execute the following command:

Code Block
languagebash
titleSort a BAM file
cd $SCRATCH/core_ngs/alignment/yeast_bwa
samtools sort -O bam -T yeast_pairedend.tmp yeast_pairedend.bam > yeast_pairedend.sort.bam

...

Exercise: Compare the file sizes of the yeast_pariedend .sam, .bam, and .sort.bam files and explain why they are different.

Expand
titleHint
Code Block
languagebash
ls -lh yeast_pairedend*
Expand
titleAnswer

The yeast_pairedend.sam text file is the largest at ~348 MB.

The name-ordered binary yeast_pairedend.bam text file only about 1/3 that size, ~110 MB. They contain exactly the same records, in the same order, but conversion from text to binary results in a much smaller file.

The coordinate-ordered binary yeast_pairedend.sort.bam file is even slightly smaller, ~91 MB. This is because BAM files are actually customized gzip-format files. The customization allows blocks of data (e.g. all alignment records for a contig) to be represented in a very an even more compact form. You can read more about this in section 4 of https://samtools.github.io/hts-specs/SAMv1.pdf.

...

the SAM format specification.

samtools index

Many tools (like the UCSC Genome Browser) only need to use sub-sections portions of the a BAM file at a given point in time. For example, if you are viewing alignments that are within a particular gene alignments , alignment records on other chromosomes generally do not need to be loaded. In order to speed up access, BAM files are indexed, producing BAI files which allow other programs to navigate directly to the alignments of interestfast random access. This is especially important when you have many alignmentsalignment records.

The utility samtools index creates an index that has the exact same name as the input BAM file, with suffix .bai appended. The help page, if you execute samtools index with no arguments, is as followsHere's the samtools index usage:

Code Block
titlesamtools index usage
Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
  -b       Generate BAI-format index for BAM files [default]
  -c       Generate CSI-format index for BAM files
  -m INT   Set minimum interval size for CSI indices to 2^INT [14]

Here, the The syntax here is way, way easier. We want a BAI-format index which is the default. (CSI-format is used with extremely long contigs, which we arendon't considering apply here - the most common use case are highly is for polyploid plant genomes).

So all we have to type is:

Code Block
languagebash
titleIndex a sorted bam
samtools index yeast_pairedend.sort.bam

This will produce a file named yeast_pairedend.bam.bai.

Most of the time when an index is required, it will be automatically located as long as it is in the same directory as its BAM file and shares the same name up until the .bai extension.

Samtools idxstats

Now that we have a sorted, indexed BAM file, we might like to get some simple statistics about the alignment run. For example, we might like to know how many reads aligned to each chromosome/contig. The samtools idxstats is a very simple tool that provides this information. If you type the command without any arguments, you will see that it could not be simpler - just type the following command:

Exercise: Compare the sizes of the sorted BAM file and its BAI index.

Expand
titleHint
Code Block
languagebash
ls -lh
Code Block
languagebash
samtools idxstats
 yeast_pairedend.sort.bam

The output is a text file with four tab-delimited columns with the following meanings:

  1. chromosome name
  2. chromosome length
  3. number of mapped reads
  4. number of unmapped reads

The reason that the "unmapped reads" field for the named chromosomes is not zero is that, if one half of a pair of reads aligns while the other half does not, the unmapped read is still assigned to the chromosome its mate mapped to, but is flagged as unmapped.

Tip

If you're mapping to a non-genomic reference such as miRBase miRNAs or another set of genes (a transcriptome), samtools idxstats gives you a quick look at quantitative alignment results.

Samtools flagstat

*
Expand
titleAnswer

While the yeast_pairedend.sort.bam text file is ~91 MB, its index (yeast_pairedend.sort.bai) is only 20 KB.

samtools flagstat

Finally, we might like to obtain some other statistics, such as the percent of all reads that aligned to the genome. The samtools flagstat tool provides very simple analysis of the SAM flag fields, which includes information like whether reads are properly paired, aligned or not, and a few other things. Its syntax is identical to that of samtools idxstats:

...

The most important statistic is the mapping rate, but this readout allows you to verify that some common expectations (e.g. that about the same number of R1 and R2 reads aligned, and that most mapped reads are proper pairs) are met.

samtools idxstats

Now that we have a sorted, indexed BAM file, we might like to get some simple statistics about the alignment run. For example, we might like to know how many reads aligned to each chromosome/contig. The samtools idxstats is a very simple tool that provides this information. If you type the command without any arguments, you will see that it could not be simpler - just type the following command:

Code Block
languagebash
samtools idxstats yeast_pairedend.sort.bam

The output is a text file with four tab-delimited columns with the following meanings:

  1. chromosome name
  2. chromosome length
  3. number of mapped reads
  4. number of unmapped reads

The reason that the "unmapped reads" field for the named chromosomes is not zero is that, if one half of a pair of reads aligns while the other half does not, the unmapped read is still assigned to the chromosome its mate mapped to, but is flagged as unmapped.

Tip

If you're mapping to a non-genomic reference such as miRBase miRNAs or another set of genes (a transcriptome), samtools idxstats gives you a quick look at quantitative alignment results.

Exercise #3: Yeast BWA PE alignment with BioITeam alignment script

...