...
First stage the sample datasets and references we will use.
Code Block |
---|
language | bash |
---|
title | Get the alignment exercises files |
---|
|
mkdir -p $SCRATCH/core_ngs/alignment/fastq
mkdir -p $SCRATCH/core_ngs/references/fasta
cp $CORENGS/alignment/*fastq.gz $SCRATCH/core_ngs/alignment/fastq/
cp $CORENGS/references/*.* $SCRATCH/core_ngs/references/fasta/ |
...
Code Block |
---|
|
module load bwa
bwa |
Expandcode |
---|
| code |
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.12-r1039
Contact: Heng Li <lh3@sanger.ac.uk>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome with `bwa index'.
There are three alignment algorithms in BWA: `mem', `bwasw', and
`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
first. Please `man ./bwa.1' for the manual. |
As you can see, bwa include many sub-commands that perform the tasks we are interested in.
...
We will index the genome with the bwa index command. Type bwa index with no arguments to see usage for this sub-command.
Code Block |
---|
|
Usage: bwa index [-a bwtsw|is] [-c] <in.fasta>
Options: -a STR BWT construction algorithm: bwtsw or is [auto]
-p STR prefix of the index [same as fasta name]
-b INT block size for the bwtsw algorithm (effective with -a bwtsw) [10000000]
-6 index files named as <in.fasta>.64.* instead of <in.fasta>.*
Warning: `-a bwtsw' does not work for short genomes, while `-a is' and
`-a div' do not work not for long genomes. Please choose `-a'
according to the length of the genome. |
...
Tip |
---|
You can tell you're in a idev session because the hostname command will return a compute node name (e.g. nid00438) instead of a login node name (e.g. login5). |
...
Expand |
---|
|
The last few lines of bwa's execution output should look something like this: Code Block |
---|
| [bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 592180 sequences have been processed.
[main] Version: 0.7.12-r1039
[main] CMD: bwa aln -t 20 sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R2.cat.fastq.gz
[main] Real time: 21.157 sec; CPU: 268.813 sec |
So the R2 alignment took only 21 seconds, or about 10 times as fast as with only one processing thread. |
...
Code Block |
---|
language | bash |
---|
title | Pairing of BWA global alignment of R1 and R2 aligned reads |
---|
|
bwa sampe sacCer3/sacCer3.fa yeast_R1.sai yeast_R2.sai fastq/Sample_Yeast_L005_R1.cat.fastq.gz fastq/Sample_Yeast_L005_R2.cat.fastq.gz > yeast_pairedend.sam |
...
Code Block |
---|
|
module load samtools
samtools |
Code Block |
---|
title | samtools SAMtools suite usage |
---|
|
Program: samtools (Tools for alignments in the SAM format)
Version: 1.3.1 (using htslib 1.3.1)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
index index alignment
-- Editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
rmdup remove PCR duplicates
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags
-- File operations
collate shuffle and group alignments by name
cat concatenate BAMs
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
quickcheck quickly check if SAM/BAM/CRAM file appears intact
fastq converts a BAM to a FASTQ
fasta converts a BAM to a FASTA
-- Statistics
bedcov read depth per BED region
depth compute the depth
flagstat simple stats
idxstats BAM index stats
phase phase heterozygotes
stats generate stats (former bamcheck)
-- Viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM |
...
That is a lot to process! For now, we just want to read in a SAM file and output a BAM file. The input format is auto-detected, so we don't need to specify it (although you do in v0.1.19). We just need to tell the tool to output the file in BAM format.
Code Block |
---|
language | bash |
---|
title | Convert SAM to binary BAM |
---|
|
cd $SCRATCH/core_ngs/alignment/yeast_bwa
samtools view -b -o yeast_pairedend.bam yeast_pairedend.sam |
...
How do you look at the BAM file contents now? That's simple. Just use samtools view without the -b option. Remember to pipe output to a pager!
Code Block |
---|
language | bash |
---|
title | View BAM records |
---|
|
samtools view yeast_pairedend.bam | more
|
...
To sort the paired-end yeast BAM file by coordinate, and get a BAM file named yeast_pairedend.sort.bam as output, execute the following command:
Code Block |
---|
language | bash |
---|
title | Sort a BAM file |
---|
|
cd $SCRATCH/core_ngs/alignment/yeast_bwa
samtools sort -O bam -T yeast_pairedend.tmp yeast_pairedend.bam > yeast_pairedend.sort.bam |
...
Exercise: Compare the file sizes of the yeast_pariedend .sam, .bam, and .sort.bam files and explain why they are different.
Expand |
---|
|
Code Block |
---|
| ls -lh yeast_pairedend* |
|
Expand |
---|
|
The yeast_pairedend.sam text file is the largest at ~348 MB. The name-ordered binary yeast_pairedend.bam text file only about 1/3 that size, ~110 MB. They contain exactly the same records, in the same order, but conversion from text to binary results in a much smaller file. The coordinate-ordered binary yeast_pairedend.sort.bam file is even slightly smaller, ~91 MB. This is because BAM files are actually customized gzip-format files. The customization allows blocks of data (e.g. all alignment records for a contig) to be represented in a very an even more compact form. You can read more about this in section 4 of https://samtools.github.io/hts-specs/SAMv1.pdf. |
...
samtools index
Many tools (like the UCSC Genome Browser) only need to use sub-sections portions of the a BAM file at a given point in time. For example, if you are viewing alignments that are within a particular gene alignments , alignment records on other chromosomes generally do not need to be loaded. In order to speed up access, BAM files are indexed, producing BAI files which allow other programs to navigate directly to the alignments of interestfast random access. This is especially important when you have many alignmentsalignment records.
The utility samtools index creates an index that has the exact same name as the input BAM file, with suffix .bai appended. The help page, if you execute samtools index with no arguments, is as followsHere's the samtools index usage:
Code Block |
---|
title | samtools index usage |
---|
|
Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
-b Generate BAI-format index for BAM files [default]
-c Generate CSI-format index for BAM files
-m INT Set minimum interval size for CSI indices to 2^INT [14] |
Here, the The syntax here is way, way easier. We want a BAI-format index which is the default. (CSI-format is used with extremely long contigs, which we arendon't considering apply here - the most common use case are highly is for polyploid plant genomes).
So all we have to type is:
Code Block |
---|
language | bash |
---|
title | Index a sorted bam |
---|
|
samtools index yeast_pairedend.sort.bam |
This will produce a file named yeast_pairedend.bam.bai.
Most of the time when an index is required, it will be automatically located as long as it is in the same directory as its BAM file and shares the same name up until the .bai extension.
Samtools idxstats
Now that we have a sorted, indexed BAM file, we might like to get some simple statistics about the alignment run. For example, we might like to know how many reads aligned to each chromosome/contig. The samtools idxstats is a very simple tool that provides this information. If you type the command without any arguments, you will see that it could not be simpler - just type the following command:
Exercise: Compare the sizes of the sorted BAM file and its BAI index.
Expand |
---|
|
|
Code Block |
---|
|
samtools idxstats |
The output is a text file with four tab-delimited columns with the following meanings:
- chromosome name
- chromosome length
- number of mapped reads
- number of unmapped reads
The reason that the "unmapped reads" field for the named chromosomes is not zero is that, if one half of a pair of reads aligns while the other half does not, the unmapped read is still assigned to the chromosome its mate mapped to, but is flagged as unmapped.
Tip |
---|
If you're mapping to a non-genomic reference such as miRBase miRNAs or another set of genes (a transcriptome), samtools idxstats gives you a quick look at quantitative alignment results. |
Samtools flagstat
Expand |
---|
|
While the yeast_pairedend.sort.bam text file is ~91 MB, its index (yeast_pairedend.sort.bai) is only 20 KB. |
samtools flagstat
Finally, we might like to obtain some other statistics, such as the percent of all reads that aligned to the genome. The samtools flagstat tool provides very simple analysis of the SAM flag fields, which includes information like whether reads are properly paired, aligned or not, and a few other things. Its syntax is identical to that of samtools idxstats:
...
The most important statistic is the mapping rate, but this readout allows you to verify that some common expectations (e.g. that about the same number of R1 and R2 reads aligned, and that most mapped reads are proper pairs) are met.
samtools idxstats
Now that we have a sorted, indexed BAM file, we might like to get some simple statistics about the alignment run. For example, we might like to know how many reads aligned to each chromosome/contig. The samtools idxstats is a very simple tool that provides this information. If you type the command without any arguments, you will see that it could not be simpler - just type the following command:
Code Block |
---|
|
samtools idxstats yeast_pairedend.sort.bam |
The output is a text file with four tab-delimited columns with the following meanings:
- chromosome name
- chromosome length
- number of mapped reads
- number of unmapped reads
The reason that the "unmapped reads" field for the named chromosomes is not zero is that, if one half of a pair of reads aligns while the other half does not, the unmapped read is still assigned to the chromosome its mate mapped to, but is flagged as unmapped.
Tip |
---|
If you're mapping to a non-genomic reference such as miRBase miRNAs or another set of genes (a transcriptome), samtools idxstats gives you a quick look at quantitative alignment results. |
Exercise #3: Yeast BWA PE alignment with BioITeam alignment script
...