Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • The -P option tells grep to Perl-style regular expression patterns. 
    • This makes including special characters like Tab ( \t ), Carriage Return ( \r ) or Linefeed ( \n ) much easier that the default POSIX paterns.
    • While it is not required here, it generally doesn't hurt to include this option.
  • '^>^>' is the regular expression describing the pattern we're looking for (described below)

  • sacCer3.fa is the file to search. 
    • lines with text that match our pattern will be written to standard output
    • non matching lines will be omitted
  • We pipe to more just in case there are a lot of contig names.

Now down to the nuts and bolts of the pattern: '^>^>'

First, the single quotes around the pattern – this tells the bash shell to pass the exact string contents to grep.

As part of its friendly command line parsing and evaluation, the shell will often look for special characters on the command line that mean something to it (for example, the $ in front of an environment variable name, like in $SCRATCH). Well, regular expressions treat the $ specially too – but in a completely different way! Those single quotes tell the shell "don't look inside here for special characters – treat this as a literal string and pass it to the program". The shell will obey, will strip the single quotes off the string, and will pass the actual pattern, ^>^>, to the grep program. (Note that the shell does look inside double quotes ( " ) for certain special signals, such as looking for environment variable names to evaluate. Read more about Quoting in the shell.)

So what does ^>^> mean to grep? We know that contig name lines always start with a > character, so > is a literal for grep to use in its pattern match.

We might be able to get away with just using this literal alone as our regex, specifying '>' as the command line argument. But for grep, the more specific the pattern, the better. So we constrain where the > can appear on the line. The special carat ( ^ )  metacharacter represents "beginning of line". So ^>^> means "beginning of a line followed by a > character".

...

alignment typealigner optionspro'scon's
global with bwa 

SEsingle end reads:

  • bwa aln <R1>
  • bwa samse

PEpaired end reads:

  • bwa aln <R1>
  • bwa aln <R2>
  • bwa sampe
  • simple to use (take default options)
  • good for basic global alignment
  • multiple steps needed
global with bowtie2bowtie2 --global
  • extremely configurable
  • can be used for RNAseq alignment (after adapter trimming) because of its many options
  • complex (many options)
local with bwa bwa mem
  • simple to use (take default options)
  • very fast
  • no adapter trimming needed
  • good for simple RNAseq analysis
    • the secondary alignments it reports provide splice junction information
  • always produces alignments with secondary reads
    • must be filtered if not desired
local with bowtie2bowtie2 --local
  • extremely configurable
  • no adapter trimming needed
  • good for small RNA alignment because of its many options
  • complex – many options

...

Like other tools you've worked with so far, you first need to load bwa. Do that now, and then enter bwa with no arguments to view the top-level help page (many NGS tools will provide some help when called with no arguments). Note that bwa is available both from the standard TACC module system and as as a BioContainers. module.

Expand
titleMake sure you're in a idev session


Code Block
languagebash
titleStart an idev session
idev -p normal -m 120 -A UT-2015-05-18OTH21164 -N 1 -nr 68 CoreNGSday4



Code Block
languagebash
module load biocontainers  # takes a while
module load bwa
bwa

...

Expand
titleAnswer

The last few lines of bwa's execution output should look something like this:

Code Block
languagebash
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... 50.76 sec
[bwa_aln_core] write to the disk... 0.07 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 50.35 sec
[bwa_aln_core] write to the disk... 0.07 sec
[bwa_aln_core] 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 13.64 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 592180 sequences have been processed.
[main] Version: 0.7.17-r1188
[main] CMD: /usr/local/bin/bwa aln sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R1.cat.fastq.gz
[main] Real time: 12278.936185 sec; CPU: 12377.597598 sec

So the R2 alignment took ~123 ~78 seconds (~2 ~1.3 minutes).

Since you have your own private compute node, you can use all its resources. It has 68 128 cores, so re-run the R2 alignment asking for 60 execution threads.

...

Expand
titleAnswer

The last few lines of bwa's execution output should look something like this:

Code Block
languagebash
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... 266.70 sec
[bwa_aln_core] write to the disk... 0.04 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 268.94 sec
[bwa_aln_core] write to the disk... 0.03 sec
[bwa_aln_core] 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 72.26 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 592180 sequences have been processed.
[main] Version: 0.7.17-r1188
[main] CMD: /usr/local/bin/bwa aln -t 60 sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R2.cat.fastq.gz
[main] Real time: 195.872013 sec; CPU: 617142.095813 sec

So the R2 alignment took only ~20 ~5 seconds (real time), or 615+ times as fast as with only one processing thread.

Note, though, that the CPU time with 60 threads was greater (617 142.8 sec) than with only 1 thread (124 77.6 sec). That's because of the thread management overhead when using multiple threads.

...

Expand
titleMake sure you're in a idev session


Code Block
languagebash
titleStart an idev session
idev -p normal -m 120 -A UT-2015-05-18OTH21164 -N 1 -n 68 p normal -r CoreNGSday4



Code Block
languagebash
# If not already loaded
module load biocontainers  # takes a while

module load samtools
samtools

...

Code Block
titleSAMtools suite usage
Program: samtools (Tools for alignments in the SAM format)
Version: 1.109 (using htslib 1.109)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     fqidx          index/extract FASTQ
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
     markdup        mark duplicates

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     coverage       alignment depth and percent coverage
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

...

The samtools view utility provides a way of converting between SAM (text) and BAM (binary, compressed) format. It also provides many, many other functions which we will discuss lster. To get a preview, execute samtools view without any other arguments. You should see:

...

Looking at some of the alignment record information (e.g. samtools view yeast_pairedend.bam | cut -f 1-4 | more), you will notice that read names appear in adjacent pairs (for the R1 and R2), in the same order they appeared in the original FASTQ file. Since that means the corresponding mappings are in no particular order, searching through the file very inefficient. samtools sort re-orders entries in the SAM file either by locus (contig name + coordinate position) or by read name.

...

Code Block
titlesamtools sort usage
Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -t TAG     Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
  -o FILE    Write final output to FILE rather than standard output
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam
  --no-PG     do not add a PG line
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]
      --verbosity INT
               Set level of verbosity

In most cases you will be sorting a BAM file from name order to locus order. You can use either -o or redirection with > to control the output.

...

Expand
titleAnswer

The yeast_pairedend.sam text file is the largest at ~348 MB.

The name-ordered binary yeast_pairedend.bam text file only about 1/3 that size, ~110 ~111 MB. They contain exactly the same records, in the same order, but conversion from text to binary results in a much smaller file.

The coordinate-ordered binary yeast_pairedend.sort.bam file is even slightly smaller, ~91 ~92 MB. This is because BAM files are actually customized gzip-format files. The customization allows blocks of data (e.g. all alignment records for a contig) to be represented in an even more compact form. You can read more about this in section 4 of the SAM format specification.

...

Expand
titleAnswer

While the yeast_pairedend.sort.bam text file is ~91 ~92 MB, its index (yeast_pairedend.sort.bai) is only 20 KB.

...

Ignore the "+ 0" addition to each line - that is a carry-over convention for counting "QA-failed reads" that is no longer relevant.

...

Expand
titleHint

Divide the number of properly paired reads by the number of mapped reads:

Code Block
languagebash
awk 'BEGIN{ print 473114 / 547664 }'
# or
echo $(( 473114 * 100 / 547664 ))
# or
echo "473114 547664" | awk '{printf("%0.1f%%\n", 100*$1/$2)}'



Expand
titleAnswer

About 86% of mapped read were properly paired. This is actually a bit on the low side for ChIP-seq alignments which typically over 90%.

...

Code Block
languagebash
titleUse samtools idxstats to summarize mapped reads by contig
samtools idxstats yeast_pairedend.sort.bam | tee yeast_pairedend.idxstats.txt

Here we use the tee command which reports its standard input to standard output before also writing it to the specified file.

Code Block
languagebash
titlesamtools idxstats output
chrI    230218  8820    1640
chrII   813184  36616   4026
chrIII  316620  13973   1530
chrIV   1531933 72675   8039
chrV    576874  27466   2806
chrVI   270161  10866   1222
chrVII  1090940 50893   5786
chrVIII 562643  24672   3273
chrIX   439888  16246   1739
chrX    745751  31748   3611
chrXI   666816  28017   2776
chrXII  1078177 54783   10124
chrXIII 924431  40921   4556
chrXIV  784333  33070   3703
chrXV   1091291 48714   5150
chrXVI  948066  44916   5032
chrM    85779   3268    291
*       0       0       571392

...