Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
module load biocontainers # optional, may take a while
module load bwa
bwa
Code Block
titleBWA suite usage
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.1216a-r1039r1181
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         shm           manage indices in shared memory
         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
      There are three alignment algorithms in BWA: `mem', `bwasw', and
      `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
      first. Please `man ./bwa.1' for the manual.

...

Code Block
titlebwa index usage
Usage:   bwa index [-a bwtsw|is] [-coptions] <in.fasta>

Options: -a STR    BWT construction algorithm: bwtsw, is or isrb2 [auto]
         -p STR    prefix of the index [same as fasta name]
		         -b INT	    block size for the bwtsw algorithm (effective with -a bwtsw) [10000000]
         -6        index files named as <in.fasta>.64.* instead of <in.fasta>.*

Warning: `-a bwtsw' does not work for short genomes, while `-a is' and
         `-a div' do not work not for long genomes. Please choose `-a'
         according to the length of the genome.

Based on the usage description, we only need to specify two things:

...

Expand
titleAnswer

The last few lines of bwa's execution output should look something like this:

Code Block
languagebash
[bwa_aln_core] write to the disk 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 012.0186 sec
[bwa_aln_core] 592180write sequencesto havethe been processed.disk... 0.00 sec
[bwa_aln_core] 592180 sequences have been processed.
[main] Version: 0.7.1216a-r1039r1181
[main] CMD: bwa aln sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R2.cat.fastq.gz
[main] Real time: 218109.832161 sec; CPU: 218108.274848 sec

So the R2 alignment took just under 4 minutes~109 seconds (1.8 minutes).

Since you have your own private compute node, you can use all its resources. It has 24 cores, so re-run the R2 alignment asking for 20 execution threads.

...

Expand
titleAnswer

The last few lines of bwa's execution output should look something like this:

Code Block
languagebash
[bwa_aln_core] 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 19.56 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 592180 sequences have been processed.
[main] Version: 0.7.1216a-r1039r1181
[main] CMD: bwa aln -t 20 sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R2.cat.fastq.gz
[main] Real time: 219.157655 sec; CPU: 268142.813968 sec

So the R2 alignment took only 21 ~10 seconds (real time), or about 10+ times as fast as with only one processing thread.

Note, though, that the CPU time with 20 threads was greater (143 sec) than with only 1 thread (109 sec). That's because of the thread management overhead when using multiple threads.

Next we use the bwa sampe command to pair the reads and output SAM format data. Just type that command in with no arguments to see its usage.

...

Exercise: What kind of information is in the first lines of the SAM file?

Expand
titleAnswer

The SAM file has a number of header lines, which all start with an at sign ( @ ).

The @SQ lines describe each contig (chromosome) and its length.

There is also a @PG  line that describes the way the bwa sampe was performed.

Exercise: How many alignment records (not header records) are in the SAM file?

...

Expand
titleAnswer
There were a total of 1,184,360 original sequences (R1s + R2s)

Exercises:

  • Do both R1 and R2 reads have separate alignment records?
  • Does the SAM file contain both mapped and un-mapped reads?
  • What is the order of the alignment records in this SAM file?

...

Expand
titleAnswer

The expression above returns 612,968. There were 1,184,360 records total, so the percentage is:

Code Block
languagebash
titleCalculate alignment rate
echo $((612968 * 100/ 1184360))
awk 'BEGIN{print 612968/1184360}

or about 51%. Not great.

Note we perform this calculation in awk's BEGIN block, which is always executed, instead of the body block, which is only executed for lines of input. And here we call awk without piping it any inputor about 51%. Not great.

Exercise: What might we try in order to improve the alignment rate?

...

The SAMtools program is a commonly used set of tools that allow a user to manipulate SAM/BAM files in many different ways, ranging from simple tasks (like SAM/BAM format conversion) to more complex functions (like sorting, indexing and statistics gathering).  It is available in the TACC module system in the typical fashion(as well as in BioContainers). Load that module and see what samtools has to offer:

...

Code Block
titleSAMtools suite usage
Program: samtools (Tools for alignments in the SAM format)
Version: 1.3.16 (using htslib 1.3.16)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     rmdup          remove PCR duplicates
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
     markdup        mark duplicates

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

...

Code Block
titlesamtools view usage
Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

Options:
  -b       output BAM
  -C       output CRAM (requires -T)
  -1       use fast BAM compression (implies -b)
  -u       uncompressed BAM output (implies -b)
  -h       include header in SAM output
  -H       print SAM header only (no alignments)
  -c       print only the count of matching records
  -o FILE  output file name [stdout]
  -U FILE  output reads not selected by filters to FILE [null]
  -t FILE  FILE listing reference names and lengths (see long help) [null]
  -L FILE  only include reads overlapping this BED FILE [null]
  -r STR   only include reads in read group STR [null]
  -R FILE  only include reads with read group listed in FILE [null]
  -q INT   only include reads with mapping quality >= INT [0]
  -l STR   only include reads in library STR [null]
  -m INT   only include reads with number of CIGAR operations consuming
           query sequence >= INT [0]
  -f INT   only include reads with all  of bitsthe setFLAGs in INT set in FLAGpresent [0]
  -F INT   only include reads with none of the bits setFLAGS in INT set in FLAGpresent [0]
  -xG STRINT   readonly tagEXCLUDE toreads stripwith (repeatable) [null]
  -B       collapse the backward CIGAR operationall  of the FLAGs in INT present [0]
  -s FLOAT integer part sets seed of random number generator [0];
  subsample reads (given INT.FRAC option value, 0.FRAC is the
         rest sets fraction of templates/read pairs to keep; INT subsamplepart [no subsampling]sets seed)
  -@, --threads INT
     x STR   read tag to strip (repeatable) [null]
  -B      number ofcollapse BAM/CRAMthe compressionbackward threadsCIGAR [0]operation
  -?       print long help, including note about region specification
  -S       ignored (input format is auto-detected)
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
  -T, --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]

That is a lot to process! For now, we just want to read in a SAM file and output a BAM file. The input format is auto-detected, so we don't need to specify it (although you do in v0.1.19). We just need to tell the tool to output the file in BAM format.

...