...
- The 1st command performs a paired-end BWA global alignment similar to what we did before, but asks that the 100 bp reads be trimmed to 50 first.
- we refer to the pre-built index for yeast by name: sacCer3
- this index is located in the /work/projects/BioITeam/ref_genome/bwa/bwtsw/sacCer3/ directory
- we provide the name of the R1 FASTQ file
- because we request a PE alignment (the 1 argument) the script will look for a similarly-named R2 file.
- all output files associated with this command will be named with the prefix bwa_global.
- we refer to the pre-built index for yeast by name: sacCer3
- The 2nd command performs a paired-end BWA local alignment.
- all output files associated with this command will be named with the prefix bwa_local.
- no trimming is requested because the local alignment should ignore 5' and 3' bases that don't match the reference genome
- The 3rd command performs a paired-end Bowtie2 global alignment.
- the Bowtie2 alignment script has the same first arguments as the BWA alignment script.
- all output files associated with this command will be named with the prefix bt2_global.
- again, we specify that reads should first be trimmed to 50 bp.
- The 4th command performs a paired-end Bowtie2 local alignment.
- all output files associated with this command will be named with the prefix bt2_local.
- again, no trimming is requested for the local alignment.
Output files
This alignment pipeline script performs the following steps:
- Hard trims FASTQ, if optionally specified (fastx_trimmer)
- Performs the global or local alignment (here, a PE alignment)
- BWA global: bwa aln the R1 and R2 separately, then bwa sampe to produce a SAM file
- BWA local: call bwa mem with both R1 and R2 to produce a SAM file
- Bowtie2 global: call bowtie2 --global with both R1 and R2 to produce a SAM file
- Bowtie2 local: call bowtie2 --local with both R1 and R2 to produce a SAM file
- BWA global: bwa aln the R1 and R2 separately, then bwa sampe to produce a SAM file
- Converts SAM to BAM (samtools view)
- Sorts the BAM (samtools sort)
- Marks duplicates (Picard MarkDuplicates)
- Indexes the sorted, duplicate-marked BAM (samtools index)
- Gathers statistics (samtools idxstats, samtools flagstat, plus a custom statistics script of Anna's)
- Removes intermediate files
...
- <prefix>.align.log – Log file of the entire alignment process.
- check the tail of this file to make sure the alignment was successful
- <prefix>.sort.dup.bam – Sorted, duplicate-marked alignment file.
- <prefix>.sort.dup.bam.bai – Index for the sorted, duplicate-marked alignment file
- <prefix>.flagstat.txt – samtools flagstat output
- <prefix>.idxstats.txt – samtools idxstats output
- <prefix>.samstats.txt – Summary alignment statistics from Anna's stats script
- <prefix>.iszinfo.txt – Insert size statistics (for paired-end alignments) from Anna's stats script
Verifying alignment success
...
The -L option tells grep to only print the filenames that don't contain the pattern. Perfect! To see what might happen in the case of failure, create a dummy file that doesn't contain the success message:
...
Checking alignment statistics
The <prefix>.samstats.txt statistics files produced by the alignment pipeline has a lot of good information in one place. If you look at bwa_global.samstats.txt you'll see something like this:
...
Since this was a paired end alignment there is paired-end specific information reported, including insert size statistics: mean/.
You can also view statistics on insert sizes for properly paired reads in the bwa_global.iszinfo.txt file. This tells you the average (mean) insert size, standard deviation, mode (most common insert size value), and fivenum values (minminimum, q11st quartile, median, q3 max insert sizes, 3rd quartile, maximum).
A quick way to check alignment stats if you have run multiple alignments is again to use grep. For example:
Code Block | language | bash|
---|---|---|
Code Block | ||
| ||
grep 'Total mapped' *samstats.txt |
will produce output like this:
| |
Insert size stats for: bwa_global Number Totalof mappedpairs: 60289316807 (50.9 %proper) bt2_local.samstats.txt: Total mapped: Number of insert sizes: 406 788069 (66.5 %) bwa_global.samstats.txt: Mean [-/+ Total mapped1 SD]: 296 [176 416] 539079 (45.5sd %120) bwa_local.samstats.txt: TotalMode mapped[Fivenum]: 228 1008000 (76.5 %[51 224 232 241 500] |
A quick way to check alignment stats if you have run multiple alignments is again to use grep. For example:
Code Block | ||||
---|---|---|---|---|
| ||||
grep 'Total mapped' *samstats.txt |
will produce output like this:
Code Block |
---|
bt2_global.samstats.txt: Total mapped: 602893 (50.9 %)
bt2_local.samstats.txt: Total mapped: 788069 (66.5 %)
bwa_global.samstats.txt: Total mapped: 539079 (45.5 %)
bwa_local.samstats.txt: Total mapped: 1008000 (76.5 % |
Exercise: How would you list the median insert size for all the alignments?
Expand | ||
---|---|---|
| ||
That information is in the *.iszinfo.txt files, on the line labeled Mode. The median value is th 3rd value in the 5 fivnum values; it is the 6th whitespace-separated field on the Mode line. |
Expand | |||||
---|---|---|---|---|---|
| |||||
Use grep to isolate the Mode line, and awk to isolate the median value field:
|
TACC batch system considerations
...