...
Stage the alignment data
First connect to login5stampede2.ls5.tacc.utexas.edu and start an idev session. This should be second nature by now
Code Block | ||||
---|---|---|---|---|
| ||||
idev -p development -m 120 -A UT-2015-05-18 -N 1 -n 68 --reservation=BIO_DATA_week_1 |
Then stage the sample datasets and references we will use.
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir -p $SCRATCH/core_ngs/alignmentreferences/fastqfasta mkdir -p $SCRATCH/core_ngs/referencesalignment/fastafastq cp $CORENGS/alignmentreferences/*fastq.gzfa $SCRATCH/core_ngs/alignmentreferences/fastqfasta/ cp $CORENGS/referencesalignment/*fastq.fagz $SCRATCH/core_ngs/references/fasta//alignment/fastq/ cd $SCRATCH/core_ngs/alignment/fastq |
These are descriptions of the FASTQ files we copied:
...
Code Block | ||||
---|---|---|---|---|
| ||||
/workwork2/projects/BioITeam/ref_genome/bwa/bwtsw/hg19 |
Tip | |||||
---|---|---|---|---|---|
The BioITeam maintains a set of reference indexes for many common organisms and aligners. They can be found in aligner-specific sub-directories of the /workwork2/projects/BioITeam/ref_genome area. E.g.:
|
...
Code Block | ||||
---|---|---|---|---|
| ||||
cd $SCRATCH/core_ngs/references/fasta
grep -P '^>' sacCer3.fa | more |
...
We might be able to get away with just using this literal alone as our regex, specifying '>' as the command line argument. But for grep, the more specific the pattern, the better. So we constrain where the > can appear on the line. The special carat ( ^ ) character metacharacter represents "beginning of line". So ^> means "beginning of a line followed by a > character".
...
alignment type | aligner options | pro's | con's |
---|---|---|---|
global with bwa | SE:
PE:
|
|
|
global with bowtie2 | bowtie2 --global |
|
|
local with bwa | bwa mem |
|
|
local with bowtie2 | bowtie2 --local |
|
|
...
Like other tools you've worked with so far, you first need to load bwa. Do that now, and then enter bwa with no arguments to view the top-level help page (many NGS tools will provide some help when called with no arguments). Note that bwa is available both from the standard TACC module system and as BioContainers. module.
Code Block | ||
---|---|---|
| ||
module load bwa
bwa |
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
Make sure you're in an idev session. If you're in an idev session, the hostname command will display a name like c455-021.stampede2.tacc.utexas.edu. But if you're on a login node the hostname will be something like login3.stampede2.tacc.utexas.edu. If you're on a login node, start an idev session like this:
|
Code Block | ||
---|---|---|
| ||
module load biocontainers # takes a while
module load bwa
bwa |
Code Block | ||
---|---|---|
| ||
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188 | ||
Code Block | ||
| ||
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.16a-r1181
Contact: Heng Li <lh3@sanger.ac.uk>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome with `bwa index'.
There are three alignment algorithms in BWA: `mem', `bwasw', and
`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
first. Please `man ./bwa.1' for the manual. |
...
Note that bwa writes its (binary) output to standard output by default, so we need to redirect that to a .sai file.
We For simplicity, we will just execute these commands directly (not in a batch job), but since they are fairly large files we will first set up an interactive development (idev) session, which will give us a compute node for 3 hours:, one at a time. Each command should only take few minutes and you will see bwa's progress messages in your terminal.
Code Block | ||||
---|---|---|---|---|
| ||||
idev -p normal -m 180 -N 1 -n 24 -A UT-2015-05-18 --reservation=intro_NGS |
Tip |
---|
You can tell you're in a idev session because the hostname command will return a compute node name (e.g. nid00438) instead of a login node name (e.g. login5). |
For simplicity, we will just execute these commands directly, one at a time. Each command should only take few minutes and you will see bwa's progress messages in your terminal.
| ||||
# If not already loaded:
module load biocontainers
module load bwa
cd $SCRATCH/core_ngs/alignment/yeast_bwa
bwa aln sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R1.cat.fastq | ||||
Code Block | ||||
---|---|---|---|---|
| ||||
module load bwa
cd $SCRATCH/core_ngs/alignment/yeast_bwa
bwa aln sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R1.cat.fastq.gz > yeast_R1.sai
bwa aln sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R2.cat.fastq.gz > yeast_R2.sai |
When all is done you should have two .sai files: yeast_R1.sai and yeast_R2.sai.
Tip | ||
---|---|---|
| ||
Double check that output was written by doing ls -lh and making sure the file sizes listed are not 0. |
Exercise: How long did it take to align the R2 file?
Expand | |||||
---|---|---|---|---|---|
| |||||
The last few lines of bwa's execution output should look something like this:
So the R2 alignment took ~109 seconds (1.8 minutes). |
Since you have your own private compute node, you can use all its resources. It has 24 cores, so re-run the R2 alignment asking for 20 execution threads.
Code Block |
---|
bwa aln -t 20 sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R2.cat.fastq.gz > yeast_R2.sai |
Exercise: How much of a speedup did you seen when aligning the R2 file with 20 threads?
Expand | |||||
---|---|---|---|---|---|
| |||||
The last few lines of bwa's execution output should look something like this:
So the R2 alignment took only ~10 ~123 seconds (real time), or 10+ times as fast as with only one processing thread. Note, though, that the CPU time with 20 threads was greater (143 sec) than with only 1 thread (109 sec). That's because of the thread management overhead when using multiple threads. |
Next we use the bwa sampe command to pair the reads and output SAM format data. Just type that command in with no arguments to see its usage.
For this command you provide the same reference index prefix as for bwa aln, along with the two .sai files and the two original FASTQ files. Also, bwa writes its output to standard output, so redirect that to a .sam file.
Here is the command line statement you need. Just execute it on the command line.
~2 minutes). |
Since you have your own private compute node, you can use all its resources. It has 68 cores, so re-run the R2 alignment asking for 60 execution threads.
Code Block | ||||
---|---|---|---|---|
bwa aln -t 60 sacCer3/sacCer3.fa | ||||
Code Block | ||||
| ||||
bwa sampe sacCer3/sacCer3.fa yeast_R1.sai yeast_R2.sai \ fastq/Sample_Yeast_L005_R1R2.cat.fastq.gz \ fastq/Sample_Yeast_L005_R2.cat.fastq.gz > yeast_pairedend.sam |
You should now have a SAM file (yeast_pairedend.sam) that contains the alignments. It's just a text file, so take a look with head, more, less, tail, or whatever you feel like. Later you'll learn additional ways to analyze the data with samtools once you create a BAM file.
> yeast_R2.sai |
Exercise: How much of a speedup did you seen when aligning the R2 file with 20 threadsExercise: What kind of information is in the first lines of the SAM file?
Expand | ||
---|---|---|
| ||
The SAM file has a number of header lines, which all start with an at sign ( @ ). The @SQ lines describe each contig (chromosome) and its length. There is also a @PG line that describes the way the bwa sampe was performed. |
Exercise: How many alignment records (not header records) are in the SAM file?
last few lines of bwa's execution output should look something like this: | ||||||||||
Expand | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
This looks for the pattern '^HWI' which is the start of every read name (which starts every alignment record).
Or use the -v (invert) option to tell grep to print all lines that don't match a particular pattern; here, all header lines, which start with @.
|
Expand | ||
---|---|---|
| ||
There are 1,184,360 alignment records. |
Exercise: How many sequences were in the R1 and R2 FASTQ files combined?
So the R2 alignment took only ~20 seconds (real time), or 6+ times as fast as with only one processing thread. Note, though, that the CPU time with 60 threads was greater (617 sec) than with only 1 thread (124 sec). That's because of the thread management overhead when using multiple threads. |
Next we use the bwa sampe command to pair the reads and output SAM format data. Just type that command in with no arguments to see its usage.
For this command you provide the same reference index prefix as for bwa aln, along with the two .sai files and the two original FASTQ files. Also, bwa writes its output to standard output, so redirect that to a .sam file.
Here is the command line statement you need. Just execute it on the command line.
Code Block | ||||
---|---|---|---|---|
| ||||
bwa sampe sacCer3/sacCer3.fa yeast_R1.sai yeast_R2.sai \
fastq/Sample_Yeast_L005_R1.cat.fastq.gz \
fastq/Sample_Yeast_L005_R2.cat.fastq.gz > yeast_pairedend.sam |
You should now have a SAM file (yeast_pairedend.sam) that contains the alignments. It's just a text file, so take a look with head, more, less, tail, or whatever you feel like. Later you'll learn additional ways to analyze the data with samtools once you create a BAM file.
Exercise: What kind of information is in the first lines of the SAM file?
Expand | ||
---|---|---|
| ||
The SAM file has a number of header lines, which all start with an at sign ( @ ). The @SQ lines describe each contig (chromosome) and its length. There is also a @PG line that describes the way the bwa sampe was performed. |
Exercise: How many alignment records (not header records) are in the SAM file?
Expand | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
This looks for the pattern '^HWI' which is the start of every read name (which starts every alignment record).
Or use the -v (invert) option to tell grep to print all lines that don't match a particular pattern; here, all header lines, which start with @.
|
Expand | ||
---|---|---|
| ||
There are 1,184,360 alignment records. |
Exercise: How many sequences were in the R1 and R2 FASTQ files combined?
Expand | ||
---|---|---|
| ||
| ||
Expand | ||
| ||
|
Expand | ||
---|---|---|
| ||
There were a total of 1,184,360 original sequences (R1s + R2s) |
...
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
The expression above returns 612,968. There were 1,184,360 records total, so the percentage is:
or about 51%. Not great. Note we perform this calculation in awk's BEGIN block, which is always executed, instead of the body block, which is only executed for lines of input. And here we call awk without piping it any input. |
...
The SAMtools program is a commonly used set of tools that allow a user to manipulate SAM/BAM files in many different ways, ranging from simple tasks (like SAM/BAM format conversion) to more complex functions (like sorting, indexing and statistics gathering). It is available in the TACC module system (as well as in BioContainers). Load that module and see what samtools has to offer:functions (like sorting, indexing and statistics gathering). It is available in the TACC module system (as well as in BioContainers). Load that module and see what samtools has to offer:
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
Make sure you're in an idev session. If you're in an idev session, the hostname command will display a name like c455-021.stampede2.tacc.utexas.edu. But if you're on a login node the hostname will be something like login3.stampede2.tacc.utexas.edu. If you're on a login node, start an idev session like this:
|
Code Block | ||
---|---|---|
| ||
# If not already loaded module load biocontainers # takes a while module load samtools samtools |
Code Block | ||
---|---|---|
| ||
Program: samtools (Tools for alignments in the SAM format) Version: 1.610 (using htslib 1.610) Usage: samtools <command> [options] Commands: -- Indexing dict create a sequence dictionary file faidx index/extract FASTA fqidx index/extract FASTQ index index alignment -- Editing calmd recalculate MD/NM tags and '=' bases fixmate fix mate information reheader replace BAM header rmdup remove PCR duplicates targetcut cut fosmid regions (for fosmid pool only) addreplacerg adds or replaces RG tags markdup mark duplicates -- File operations collate shuffle and group alignments by name cat concatenate BAMs merge merge sorted alignments mpileup multi-way pileup sort sort alignment file split splits a file by read group quickcheck quickly check if SAM/BAM/CRAM file appears intact fastq converts a BAM to a FASTQ fasta converts a BAM to a FASTA -- Statistics bedcov read depth per BED region coverage alignment depth and percent coverage depth compute the depth flagstat simple stats idxstats BAM index stats phase phase heterozygotes stats generate stats (former bamcheck) -- Viewing flags explain BAM flags tview text alignment viewer view SAM<->BAM<->CRAM conversion depad convert padded BAM to unpadded BAM |
...
Warning | ||
---|---|---|
| ||
There are two main "eras" of SAMtools development:
Unfortunately, some functions with the same name in both version eras have different options and arguments! So be sure you know which version you're using. (The samtools version is usually reported at the top of its usage listing).the top of its usage listing). TACC BioContainers also offers the original samtools version: samtools/ctrThe default version in the ls5 module system is a "modern" version, but the BioITeam has a copy of the version 0.1.19 samtools for programs that might need it: /work/projects/BioITeam/ls5/bin/samtools-0.1.19. That version is also available as a TACC BioContainers module.--3. |
samtools view
The samtools view utility provides a way of converting between SAM (text) and BAM (binary, compressed) format. It also provides many, many other functions which we will discuss lster. To get a preview, execute samtools view without any other arguments. You should see:
Code Block | ||
---|---|---|
| ||
Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...] Options: -b output BAM -C output CRAM (requires -T) -1 use fast BAM compression (implies -b) -u uncompressed BAM output (implies -b) -h include header in SAM output -H print SAM header only (no alignments) -c print only the count of matching records -o FILE output file name [stdout] -U FILE output reads not selected by filters to FILE [null] -t FILE FILE listing reference names and lengths (see long help) [null] -X include customized index file -L FILE only include reads overlapping this BED this BED FILE [null] -r STR only include reads in read group STR [null] -R FILE only include reads with read group listed in FILE [null] -rd STR:STR only include reads in read group with tag STR and associated value STR [null] -RD STR:FILE only include reads with read grouptag STR and associated values listed in FILE [null] -q INT only include reads with mapping quality >= INT [0] -l STR only include reads in library STR [null] -m INT only include reads with number of CIGAR operations consuming query sequence >= INT [0] -f INT only include reads with all of the FLAGs in INT present [0] -F INT only include reads with none of the FLAGS in INT present [0] -G INT only EXCLUDE reads with all of the FLAGs in INT present [0] -s FLOAT subsample reads (given INT.FRAC option value, 0.FRAC is the the fraction of templates/read pairs to keep; INT part sets seed) -M use the multi-region iterator (increases the speed, removes duplicates and fractionoutputs ofthe templates/readreads pairsas tothey keep;are INTordered partin setsthe seedfile) -x STR read tag to strip (repeatable) [null] -B collapse the backward CIGAR operation -? print long help, including note about region specification -S ignored (input format is auto-detected) --no-PG do not add a PG line --input-fmt-option OPT[=VAL] Specify a single input file format option in the form of OPTION or OPTION=VALUE -O, --output-fmt FORMAT[,OPT[=VAL]]... Specify output format (SAM, BAM, CRAM) --output-fmt-option OPT[=VAL] Specify a single output file format option in the form of OPTION or OPTION=VALUE -T, --reference FILE Reference sequence FASTA FILE [null] -@, --threads INT FILE [null] -@, --threads INT Number of additional threads to use [0] --write-index Automatically index the output files [off] --verbosity INT Set Numberlevel of additional threads to use [0]verbosity |
That is a lot to process! For now, we just want to read in a SAM file and output a BAM file. The input format is auto-detected, so we don't need to specify it (although you do in v0.1.19). We just need to tell the tool to output the file in BAM format, and to include the header records.
...
Code Block | ||
---|---|---|
| ||
Usage: samtools sort [options...] [in.bam] Options: -l INT Set compression level, from 0 (uncompressed) to 9 (best) -m INT Set maximum memory per thread; suffix K/M/G recognized [768M] -n Sort by read name -t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set) -o FILE Write final output to FILE rather than standard output -T PREFIX Write temporary files to PREFIX.nnnn.bam --no-PG do not add a PG line --input-fmt-option OPT[=VAL] Specify a single input file format option in the form of OPTION or OPTION=VALUE -O, --output-fmt FORMAT[,OPT[=VAL]]... Specify output format (SAM, BAM, CRAM) --output-fmt-option OPT[=VAL] Specify a single output file format option in the form of OPTION or OPTION=VALUE --reference FILE Reference sequence FASTA FILE [null] -@, --threads INT Number of additional threads to use [0] --verbosity INT Set level of verbosity |
In most cases you will be sorting a BAM file from name order to locus order. You can use either -o or redirection with > to control the output.
...
Now that you've done everything the hard way, let's see how to do run an alignment pipeline using a BWA alignment script maintained by the BioITeam, /workwork2/projects/BioITeam/common/script/align_bwa_illumina.sh. Type in the script name to see its usage.
...
We're going to run this script and a similar Bowtie2 alignment script, on the yeast data using the TACC batch system. In a new directory, copy over the commands and submit the batch job. We ask for 1 hour 2 hours (-t 0102:00:00) with 4 tasks/node (-w 4); since we have 4 commands, this will run on 1 compute node.
...
Code Block | ||||
---|---|---|---|---|
| ||||
# Make sure you're not in an idev session by looking at the hostname hostname # If the hostname startslooks withlike "nidc455-004.stampede2.tacc.utexas.edu", exit the idev session # Make a new alignment directory for running these scripts mkdir -p $SCRATCH/core_ngs/alignment/bwa_script cd $SCRATCH/core_ngs/alignment/bwa_script ln -s -f ../fastq # Copy the alignment commands file and submit the batch job cp $CORENGS/tacc/aln_script.cmds . launcher_creator.py -j aln_script.cmds -n aln_script -t 0102:00:00 -w 4 -a UT-2015-05-18 -q normal sbatch --reservation=intro_NGSBIO_DATA_week_1 aln_script.slurm showq -u |
...
Code Block | ||||
---|---|---|---|---|
| ||||
/workwork2/projects/BioITeam/common/script/align_bwa_illumina.sh global ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz bwa_global sacCer3 1 50 /workwork2/projects/BioITeam/common/script/align_bwa_illumina.sh local ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz bwa_local sacCer3 1 /workwork2/projects/BioITeam/common/script/align_bowtie2_illumina.sh global ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz bt2_global sacCer3 1 50 /workwork2/projects/BioITeam/common/script/align_bowtie2_illumina.sh local ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz bt2_local sacCer3 1 |
...
The great thing about pipeline scripts like this is that you can perform alignments on many datasets in parallel at TACC, and they are written to take advantage of having multiple cores on TACC nodes where possible.
On the Lonestar5 stampede2, with its 24 68 physical cores per node, they are designed to run best with no more than 4 tasks per node.
Tip | ||
---|---|---|
| ||
These alignment scripts should always be run with a wayness of 4 (-w 4) in the Lonestar5 stampede2 batch system, meaning at most 4 commands per node. |
...