...
Code Block | ||
---|---|---|
| ||
module load biocontainers # optional, may take a while
module load bwa
bwa |
Code Block | ||
---|---|---|
| ||
Program: bwa (alignment via Burrows-Wheeler transformation) Version: 0.7.1216a-r1039r1181 Contact: Heng Li <lh3@sanger.ac.uk> Usage: bwa <command> [options] Command: index index sequences in the FASTA format mem BWA-MEM algorithm fastmap identify super-maximal exact matches pemerge merge overlapping paired ends (EXPERIMENTAL) aln gapped/ungapped alignment samse generate alignment (single ended) sampe generate alignment (paired ended) bwasw BWA-SW for long queries shm manage indices in shared memory fa2pac convert FASTA to PAC format pac2bwt generate BWT from PAC pac2bwtgen alternative algorithm for generating BWT bwtupdate update .bwt to the new format bwt2sa generate SA from BWT and Occ Note: To use BWA, you need to first index the genome with `bwa index'. There are three alignment algorithms in BWA: `mem', `bwasw', and `aln/samse/sampe'. If you are not sure which to use, try `bwa mem' first. Please `man ./bwa.1' for the manual. |
...
Code Block | ||
---|---|---|
| ||
Usage: bwa index [-a bwtsw|is] [-coptions] <in.fasta> Options: -a STR BWT construction algorithm: bwtsw, is or isrb2 [auto] -p STR prefix of the index [same as fasta name] -b INT block size for the bwtsw algorithm (effective with -a bwtsw) [10000000] -6 index files named as <in.fasta>.64.* instead of <in.fasta>.* Warning: `-a bwtsw' does not work for short genomes, while `-a is' and `-a div' do not work not for long genomes. Please choose `-a' according to the length of the genome. |
Based on the usage description, we only need to specify two things:
...
Expand | |||||
---|---|---|---|---|---|
| |||||
The last few lines of bwa's execution output should look something like this:
So the R2 alignment took just under 4 minutes~109 seconds (1.8 minutes). |
Since you have your own private compute node, you can use all its resources. It has 24 cores, so re-run the R2 alignment asking for 20 execution threads.
...
Expand | |||||
---|---|---|---|---|---|
| |||||
The last few lines of bwa's execution output should look something like this:
So the R2 alignment took only 21 ~10 seconds (real time), or about 10+ times as fast as with only one processing thread. Note, though, that the CPU time with 20 threads was greater (143 sec) than with only 1 thread (109 sec). That's because of the thread management overhead when using multiple threads. |
Next we use the bwa sampe command to pair the reads and output SAM format data. Just type that command in with no arguments to see its usage.
...
Exercise: What kind of information is in the first lines of the SAM file?
Expand | ||
---|---|---|
| ||
The SAM file has a number of header lines, which all start with an at sign ( @ ). The @SQ lines describe each contig (chromosome) and its length. There is also a @PG line that describes the way the bwa sampe was performed. |
Exercise: How many alignment records (not header records) are in the SAM file?
...
Expand | ||
---|---|---|
| ||
There were a total of 1,184,360 original sequences (R1s + R2s) |
Exercises:
- Do both R1 and R2 reads have separate alignment records?
- Does the SAM file contain both mapped and un-mapped reads?
- What is the order of the alignment records in this SAM file?
...
Expand | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
The expression above returns 612,968. There were 1,184,360 records total, so the percentage is:
or about 51%. Not great. Note we perform this calculation in awk's BEGIN block, which is always executed, instead of the body block, which is only executed for lines of input. And here we call awk without piping it any inputor about 51%. Not great. |
Exercise: What might we try in order to improve the alignment rate?
...
The SAMtools program is a commonly used set of tools that allow a user to manipulate SAM/BAM files in many different ways, ranging from simple tasks (like SAM/BAM format conversion) to more complex functions (like sorting, indexing and statistics gathering). It is available in the TACC module system in the typical fashion(as well as in BioContainers). Load that module and see what samtools has to offer:
...
Code Block | ||
---|---|---|
| ||
Program: samtools (Tools for alignments in the SAM format) Version: 1.3.16 (using htslib 1.3.16) Usage: samtools <command> [options] Commands: -- Indexing dict create a sequence dictionary file faidx index/extract FASTA index index alignment -- Editing calmd recalculate MD/NM tags and '=' bases fixmate fix mate information reheader replace BAM header rmdup remove PCR duplicates targetcut cut fosmid regions (for fosmid pool only) addreplacerg adds or replaces RG tags markdup mark duplicates -- File operations collate shuffle and group alignments by name cat concatenate BAMs merge merge sorted alignments mpileup multi-way pileup sort sort alignment file split splits a file by read group quickcheck quickly check if SAM/BAM/CRAM file appears intact fastq converts a BAM to a FASTQ fasta converts a BAM to a FASTA -- Statistics bedcov read depth per BED region depth compute the depth flagstat simple stats idxstats BAM index stats phase phase heterozygotes stats generate stats (former bamcheck) -- Viewing flags explain BAM flags tview text alignment viewer view SAM<->BAM<->CRAM conversion depad convert padded BAM to unpadded BAM |
...
Code Block | ||
---|---|---|
| ||
Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...] Options: -b output BAM -C output CRAM (requires -T) -1 use fast BAM compression (implies -b) -u uncompressed BAM output (implies -b) -h include header in SAM output -H print SAM header only (no alignments) -c print only the count of matching records -o FILE output file name [stdout] -U FILE output reads not selected by filters to FILE [null] -t FILE FILE listing reference names and lengths (see long help) [null] -L FILE only include reads overlapping this BED FILE [null] -r STR only include reads in read group STR [null] -R FILE only include reads with read group listed in FILE [null] -q INT only include reads with mapping quality >= INT [0] -l STR only include reads in library STR [null] -m INT only include reads with number of CIGAR operations consuming query sequence >= INT [0] -f INT only include reads with all of bitsthe setFLAGs in INT set in FLAGpresent [0] -F INT only include reads with none of the bits setFLAGS in INT set in FLAGpresent [0] -xG STRINT readonly tagEXCLUDE toreads stripwith (repeatable) [null] -B collapse the backward CIGAR operationall of the FLAGs in INT present [0] -s FLOAT integer part sets seed of random number generator [0]; subsample reads (given INT.FRAC option value, 0.FRAC is the rest sets fraction of templates/read pairs to keep; INT subsamplepart [no subsampling]sets seed) -@, --threads INT x STR read tag to strip (repeatable) [null] -B number ofcollapse BAM/CRAMthe compressionbackward threadsCIGAR [0]operation -? print long help, including note about region specification -S ignored (input format is auto-detected) --input-fmt-option OPT[=VAL] Specify a single input file format option in the form of OPTION or OPTION=VALUE -O, --output-fmt FORMAT[,OPT[=VAL]]... Specify output format (SAM, BAM, CRAM) --output-fmt-option OPT[=VAL] Specify a single output file format option in the form of OPTION or OPTION=VALUE -T, --reference FILE Reference sequence FASTA FILE [null] -@, --threads INT Number of additional threads to use [0] |
That is a lot to process! For now, we just want to read in a SAM file and output a BAM file. The input format is auto-detected, so we don't need to specify it (although you do in v0.1.19). We just need to tell the tool to output the file in BAM format.
...