Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • The 1st command performs a paired-end BWA global alignment similar to what we did before, but asks that the 100 bp reads be trimmed to 50 first.
    • we refer to the pre-built index for yeast by name: sacCer3
      • this index is located in the /work/projects/BioITeam/ref_genome/bwa/bwtsw/sacCer3/ directory
    • we provide the name of the R1 FASTQ file
      • because we request a PE alignment (the 1 argument) the script will look for a similarly-named R2 file.
    • all output files associated with this command will be named with the prefix bwa_global.
  • The 2nd command performs a paired-end BWA local alignment.
    • all output files associated with this command will be named with the prefix bwa_local.
    • no trimming is requested because the local alignment should ignore 5' and 3' bases that don't match the reference genome
  • The 3rd command performs a paired-end Bowtie2 global alignment.
    • the Bowtie2 alignment script has the same first arguments as the BWA alignment script.
    • all output files associated with this command will be named with the prefix bt2_global.
    • again, we specify that reads should first be trimmed to 50 bp.
  • The 4th command performs a paired-end Bowtie2 local alignment.
    • all output files associated with this command will be named with the prefix bt2_local.
    • again, no trimming is requested for the local alignment.

Output files

This alignment pipeline script performs the following steps:

  • Hard trims FASTQ, if optionally specified (fastx_trimmer)
  • Performs the global or local alignment (here, a PE alignment)
    • BWA globalbwa aln the R1 and R2 separately, then bwa sampe to produce a SAM file
    • BWA local: call bwa mem with both R1 and R2 to produce a SAM file
    • Bowtie2 globalcall bowtie2 --global with both R1 and R2 to produce a SAM file
    • Bowtie2 localcall bowtie2 --local with both R1 and R2 to produce a SAM file
  • Converts SAM to BAM (samtools view)
  • Sorts the BAM (samtools sort)
  • Marks duplicates (Picard MarkDuplicates)
  • Indexes the sorted, duplicate-marked BAM (samtools index)
  • Gathers statistics (samtools idxstats, samtools flagstat, plus a custom statistics script of Anna's)
  • Removes intermediate files

...

  1. <prefix>.align.log – Log file of the entire alignment process.
    • check the tail of this file to make sure the alignment was successful
  2. <prefix>.sort.dup.bam – Sorted, duplicate-marked alignment file.
  3. <prefix>.sort.dup.bam.bai – Index for the sorted, duplicate-marked alignment file
  4. <prefix>.flagstat.txt samtools flagstat output
  5. <prefix>.idxstats.txt samtools idxstats output
  6. <prefix>.samstats.txt – Summary alignment statistics from Anna's stats script
  7. <prefix>.iszinfo.txt – Insert size statistics (for paired-end alignments) from Anna's stats script

Verifying alignment success

...

The -L option tells grep to only print the filenames that don't contain the pattern. Perfect! To see what might happen in the case of failure, create a dummy file that doesn't contain the success message:

...

Checking alignment statistics

The <prefix>.samstats.txt statistics files produced by the alignment pipeline has a lot of good information in one place. If you look at bwa_global.samstats.txt you'll see something like this:

...

Since this was a paired end alignment there is paired-end specific information reported, including insert size statistics: mean/.

You can also view statistics on insert sizes for properly paired reads in the bwa_global.iszinfo.txt file. This tells you the average (mean) insert size, standard deviation, mode (most common insert size value), and fivenum values (minminimum, q11st quartile, median, q3 max insert sizes, 3rd quartile, maximum).

A quick way to check alignment stats if you have run multiple alignments is again to use grep. For example:


bash
Code Block
language
titleReview multiple alignment rates
grep 'Total mapped' *samstats.txt

will produce output like this:

<prefix>.iszinfo.txt output
  Insert size stats for: bwa_global
 
Code Block
bt2_global.samstats.txt:       Number Totalof mappedpairs:    60289316807 (50.9 %proper)
bt2_local.samstats.txt:        Total mapped:  Number of insert sizes: 406
   788069 (66.5 %)
bwa_global.samstats.txt:   Mean [-/+    Total mapped1 SD]: 296 [176 416] 539079 (45.5sd %120)
bwa_local.samstats.txt:         TotalMode mapped[Fivenum]: 228  1008000 (76.5 %[51 224 232 241 500]

A quick way to check alignment stats if you have run multiple alignments is again to use grep. For example:

Code Block
languagebash
titleReview multiple alignment rates
grep 'Total mapped' *samstats.txt

will produce output like this:

Code Block
bt2_global.samstats.txt:        Total mapped:    602893 (50.9 %)
bt2_local.samstats.txt:        Total mapped:    788069 (66.5 %)
bwa_global.samstats.txt:        Total mapped:    539079 (45.5 %)
bwa_local.samstats.txt:        Total mapped:   1008000 (76.5 %

Exercise: How would you list the median insert size for all the alignments?


Expand
titleHint

That information is in the *.iszinfo.txt files, on the line labeled Mode.

The median value is th 3rd value in the 5 fivnum values; it is the 6th whitespace-separated field on the Mode line.



Expand
titleAnswer

Use grep to isolate the Mode line, and awk to isolate the median value field:

Code Block
languagebash
grep 'Mode' *.iszinfo.txt | awk '{print $1,"Median insert size:",$7}'

TACC batch system considerations

...