...
Since the BAM file contains records for both mapped and unmapped reads, just counting records doesn't provide information about the mapping rate of our alignment. The samtools flagstat tool provides a simple analysis of mapping rate based on the the SAM flag fields.
Here's how to run samtools flagstat and both see the output in the terminal and save it in a file – the samtools flagstat standard output is piped to tee, which both writes it to the specified file and sends it to its standard output:
...
You should see something like this:
Code Block | ||
---|---|---|
| ||
1184360 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 547664 + 0 mapped (46.24%:-nan%) 1184360 + 0 paired in sequencing 592180 + 0 read1 592180 + 0 read2 473114 + 0 properly paired (39.95%:-nan%) 482360 + 0 with itself and mate mapped 65304 + 0 singletons (5.51%:-nan%) 534 + 0 with mate mapped to a different chr 227 + 0 with mate mapped to a different chr (mapQ>=5) |
...
The most important statistic is the mapping rate (here 46%) but this readout also allows you to verify that some common expectations (e.g. that about the same number of R1 and R2 reads aligned, and that most mapped reads are proper pairs) are met.
Exercise: What proportion of mapped reads were properly paired?
Expand | |||||
---|---|---|---|---|---|
| |||||
Divide the number of properly paired reads by the number of mapped reads:
|
Expand | ||
---|---|---|
| ||
About 86% of mapped read were properly paired. This is actually a bit on the low side for ChIP-seq alignments which typically over 90%. |
samtools idxstats
Now that we have a sorted, indexed BAM file, we might like to get some simple statistics about the alignment run. For example, we might like to know More information about the alignment is provided by the samtools idxstats report, which shows how many reads aligned to each chromosome/ contig in your reference. The Note that samtools idxstats is a very simple tool that provides this information. If you type the command without any arguments, you will see that it could not be simpler - just type the following command: must be run on a sorted, indexed BAM file.
Code Block | ||||
---|---|---|---|---|
| ||||
samtools idxstats yeast_pairedend.sort.bam |
The output is a text file with four tab-delimited columns with the following meanings:
- chromosome name
- chromosome length
- number of mapped reads
- number of unmapped reads
The reason that the "unmapped reads" field for the named chromosomes is not zero is that, if one half of a pair of reads aligns while the other half does not, the unmapped read is still assigned to the chromosome its mate mapped to, but is flagged as unmapped.
Tip |
---|
If you're mapping to a non-genomic reference such as miRBase miRNAs or another set of genes (a transcriptome), samtools idxstats gives you a quick look at quantitative alignment results. |
Exercise #3: Yeast BWA PE alignment with BioITeam alignment script
Now that you've done everything the hard way, let's see how to do run an alignment pipeline using a BWA alignment script.
First the setup:
Code Block | ||
---|---|---|
| ||
mkdir -p $SCRATCH/core_ngs/align2/fastq
cd $SCRATCH/core_ngs/align2/fastq
cp /corral-repl/utexas/BioITeam/core_ngs_tools/alignment/*fastq.gz . |
Before executing the script you need to have one environment variable set. We'll do it at the command line here, but you could put it in your .bashrc file.
Code Block | ||
---|---|---|
| ||
export path_code=/work/01063/abattenh/code |
| tee yeast_pairedend.idxstats.txt |
Code Block | ||||
---|---|---|---|---|
| ||||
chrI 230218 8820 1640
chrII 813184 36616 4026
chrIII 316620 13973 1530
chrIV 1531933 72675 8039
chrV 576874 27466 2806
chrVI 270161 10866 1222
chrVII 1090940 50893 5786
chrVIII 562643 24672 3273
chrIX 439888 16246 1739
chrX 745751 31748 3611
chrXI 666816 28017 2776
chrXII 1078177 54783 10124
chrXIII 924431 40921 4556
chrXIV 784333 33070 3703
chrXV 1091291 48714 5150
chrXVI 948066 44916 5032
chrM 85779 3268 291
* 0 0 571392 |
The output has four tab-delimited columns:
- contig name
- contig length
- number of mapped reads
- number of unmapped reads
The reason that the "unmapped reads" field for named chromosomes is not zero is that the aligner may initially assign a potential mapping (contig name and start coordinate) to a read, but then mark it later as unampped if it does meet various quality thresholds.
Tip |
---|
If you're mapping to a non-genomic reference such as miRBase miRNAs or another set of genes (a transcriptome), samtools idxstats gives you a quick look at quantitative alignment results. |
Exercise #3: BWA PE alignment with BioITeam script
Now that you've done everything the hard way, let's see how to do run an alignment pipeline using a BWA alignment script maintained by the BioITeam
First the setup, to work in a new directory:
Code Block | ||
---|---|---|
| ||
mkdir -p $SCRATCH/core_ngs/alignment/bwa_script
cd $SCRATCH/core_ngs/alignment/bwa_script
ln -s -f ../../fastq |
The BioITeam BWA alignment script is /work/projects/BioITeam/common/script/align_bwa_illumina.sh. Type in the script name to see its usage.
Code Block | ||
---|---|---|
| ||
align_bwa_illumina.sh 2017_09_07
Align Illumina SE or PE data with bwa. Produces a sorted, indexed,
duplicate-marked BAM file and various statistics files. Usage:
align_bwa_illumina.sh <aln_mode> <in_file> <out_pfx> <assembly> [ paired trim_sz trim_sz2 seq_fmt qual_fmt ]
Required arguments:
aln_mode Alignment mode, either global (bwa aln) or local (bwa mem).
in_file For single-end alignments, path to input sequence file.
For paired-end alignments using fastq, path to the the R1
fastq file which must contain the string 'R1' in its name.
The corresponding 'R2' must have the same path except for 'R1'.
out_pfx Desired prefix of output files in the current directory.
assembly One of hg38, hg19, hg38, mm10, mm9, sacCer3, sacCer1, ce11, ce10,
danRer7, hs_mirbase, mm_mirbase, or reference index prefix.
Optional arguments:
paired 0 = single end alignment (default); 1 = paired end.
trim_sz Size to trim reads to. Default 0 (no trimming)
trim_sz2 Size to trim R2 reads to for paired end alignments.
Defaults to trim_sz
seq_fmt Format of sequence file (fastq, bam or scarf). Default is
fastq if the input file has a '.fastq' extension; scarf
if it has a '.sequence.txt' extension.
qual_type Type of read quality scores (sanger, illumina or solexa).
Default is sanger for fastq, illumina for scarf.
Environment variables:
show_only 1 = only show what would be done (default not set)
aln_args other bowtie2 options (e.g. '-T 20' for mem, '-l 20' for aln)
no_markdup 1 = don't mark duplicates (default 0, mark duplicates)
run_fastqc 1 = run fastqc (default 0, don't run). Note that output
will be in the directory containing the fastq files.
keep 1 = keep unsorted BAM (default 0, don't keep)
bwa_bin BWA binary to use. Default bwa 0.7.x. Note that bwa 0.6.2
or earlier should be used for scarf and other short reads.
also: NUM_THREADS, BAM_SORT_MEM, SORT_THREADS, JAVA_MEM_ARG
Examples:
align_bwa_illumina.sh local ABC_L001_R1.fastq.gz my_abc hg38 1
align_bwa_illumina.sh global ABC_L001_R1.fastq.gz my_abc hg38 1 50
align_bwa_illumina.sh global sequence.txt old sacCer3 0 '' '' scarf solexa |
There are lots of bells and whistles in the arguments, but the most important are the first few:
- aln_mode – whether to perform a global or local alignment (the 1st argument must be one of those words)
- global mode uses the bwa aln workflow as we did above
- local mode uses the bwa mem command
- FASTQ file – full or relative path to the FASTQ file (just the R1 fastq if paired end). Can be compressed (.gz)
- output prefix – prefix for all the output files produced by the script. Should relate back to what the data is.
- assembly – genome assembly to use.
- there are pre-built indexes for some common eukaryotes (hg19, hg18, mm10, mm9, danRer7, sacCer3) that you can use
- or provide a full path for a bwa reference index you have built somewhere
- paired flag – 0 means single end (the default); 1 means paired end
- hard trim length – if you want the FASTQ hard trimmed down to a specific length before alignment, supply that number here
We're going to run this script and a similar Bowtie2 alignment script, on the yeast data using the TACC batch system. Copy over the commands file and take a look at it:Now change into the directory and call the script with no arguments to see usage
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/core_ngs/alignment/align2 $pathbwa_code/script cp $CORENGS/align/tacc/aln_script.cmds . cat aln_script.cmds |
Code Block | ||
---|---|---|
| ||
align_bwa_illumina.sh |
There are lots of bells and whistles in the arguments, but the most important are the first few:
...
- there are pre-built indexes for some common eukaryotes (hg19, hg18, mm10, mm9, danRer7, sacCer3) that you can use
- or provide a full path for a bwa reference index you have built somewhere
...
./fastq/Sample_Yeast_L005_R1.cat.fastq.gz yeast_chip sacCer3 1 |
...
in both global and local alignment modes, and also ru
Now run the pipeline. By piping the output to tee <filename> we can see the script's progress at the terminal, and it also is written to <filename>.
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/core_ngs/alignment/bwa_script
$path_code/script/align/align_bwa_illumina.sh ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz yeast_chip sacCer3 1 2>&1 | tee aln.yeast_chip.log |
...