The first step in many next-gen sequence analyses is to map sequencing reads to a reference genome. In this tutorial we'll run some common mapping tools on TACC.
The world of read mappers seems to be settling down a bit after being a bioinformatics Wild West where there was a new gun in town every week that promised to be a faster shot than the current record holder. Things seem to have reached the point where there is mainly a trade-off between speed and accuracy among read mappers that have remained popular.
There are over 50 read mapping programs listed here. We're going to (mainly) stick to just two in this course.
Each mapper has its own set of limitations (on the lengths of reads it accepts, on how it outputs read alignments, on how many mismatches there can be, on whether it produces gapped alignments, on whether it supports SOLiD colorspace data, etc.).
Mapping tools summary
Tool |
TACC |
Version |
Download |
Manual |
Example |
---|---|---|---|---|---|
BWA |
module load bwa/0.6.2 |
0.6.1; 0.6.2 |
|||
Bowtie |
module load bowtie/0.12.8 |
0.12.8 |
|||
Bowtie2 |
module load bowtie/2.0.0b6 |
2.0.0b6 |
Example: E. coli genome re-sequencing data
Data
The data files for this example are in the path:
/corral-repl/utexas/BioITeam/ngs_course/intro_to_mapping/data
File Name |
Description |
Sample |
---|---|---|
|
Paired-end Illumina, First of pair, FASTQ format |
Re-sequenced E. coli genome |
|
Paired-end Illumina, Second of pair, FASTQ format |
Re-sequenced E. coli genome |
|
Reference Genome in Genbank format |
E. coli B strain REL606 |
The easiest way to run the tutorial is to copy this entire directory to your $SCRATCH space and then run all of the commands from inside that directory. See if you can figure out how to do that. When you're in the right place, you should get output like this from the ls
command.
login1$ ls NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq
Exercises
- What basic linux commands could help us quickly peek at the files that were just copied to get an idea of their contents?
- How many sequences are in the file
SRR030257_1.fastq
?
- How many bases long are the reads in
SRR030257_1.fastq
?
Converting sequence file formats
Occasionally you might download a sequence or have it emailed to you by a collaborator in one format, and then the program that you want to use demands that it be in another format. Why do they have to be so picky?
The bp_seqconvert.pl
script that is installed as part of Bioperl is one helpful utility for converting between many common sequence formats. On TACC, the Bioperl modules are installed, but the helper script isn't. So, we've put it in a place that you can copy it from for your convenience. However, remember that any time that you use the script you must have the bioperl module loaded.
cp /corral-repl/utexas/BioITeam/ngs_course/scripts/bp_seqconvert.pl .
Run the script without any arguments to get the help message:
module load bioperl ./bp_seqconvert.pl
Exercises
The file NC_012967.1.gbk
is in Genbank
format. The files SRR030257_*.fastq
are in FASTQ
format.
- Convert
NC_012967.1.gbk
toEMBL
format. Call the outputNC_012967.1.embl
.- Does EMBL format have sequence features (like genes) annotated?
- Convert the first 10,000 lines of
SRR030257_1.fastq
toFASTA
format.- What information was lost by this conversion?
Advanced Exercise
- Write your own Genbank to FASTA sequence converter using Perl, Python, or shell scripting. Can you make it faster than the BioPerl one?
Extra reading
Mapping with bowtie
Load the bowtie module
module load bowtie module list
Create a fresh output directory.
mkdir bowtie
Convert the reference file from GenBank to FASTA using what you learned above. Name the new output file NC_012967.1.fasta
AND put it in the new directory bowtie
.
Next, index the reference file:
bowtie-build bowtie/NC_012967.1.fasta bowtie/NC_012967.1
Take a look at your output directory using ls bowtie
to see what new files have appeared.
Why do we create an index?
Like an index for a book (in the olden days before Kindles and Nooks), creating an index for a computer database allows quick access to any "record" given a short "key". In the case of mapping programs, creating an index for a reference sequence allows it to more rapidly place a read on that sequence at a location where it knows at least a piece of the read matches perfectly or with only a few mismatches. By jumping right to these spots in the genome, rather than trying to align the read to every place in the genome, it saves a ton of time.
Indexing is a separate step in running most mapping programs because it can take a LONG time if you are indexing a very large genome (like our own overly complicated human genome). Furthermore, you only need to index a genome sequence once, no matter how many samples you want to map.
Finally, map the reads:
Submit via qsub
Use launcher_creator.py followed by qsub to submit this command.
bowtie -t --sam bowtie/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq bowtie/SRR030257.sam
What does the -t
option do?
Your final output file is in SAM format. It's just a text file, so you can peek at it and see what it's like inside. Two warnings though:
- SAM files can be enormously humongous text files (maybe >1 gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like
head
orgrep
or using a viewer like IGV, which we will cover later. - SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field (??) and CIGAR strings. We'll take a look at some of these later, if we have time.
Still, you should recognize some of the information on a line in a SAM file from the input FASTQ, and some of the other information is relatively straightforward to understand, like the position where the read mapped. Give this a try:
head bowtie/SRR030257.sam
What do you think the 4th and 8th columns mean?
More reading about SAM files
Mapping with BWA
BWA (the Burrows-Wheeler Aligner) is another fast mapping program. It's speedier successor to another aligner you might have used or heard of called MAQ (Mapping and Assembly with Quality).
The steps of running BWA are very similar to running bowtie.
Load the module:
module load bwa
There are multiple versions of bwa on TACC, so you might want to check which one you have loaded for when you write up your paper.
module spider bwa module list bwa
Create a fresh output directory.
mkdir bwa
Convert the reference file to FASTA (again) or copy from the bowtie example. Be sure to put this file in the bwa
directory.
Index the reference file.
bwa index bwa/NC_012967.1.fasta
Take a look at your output directory using ls bwa
to see what new files have appeared.
Find the locations where reads map:
Submit via qsub
Use launcher_creator.py followed by qsub to submit this command.
bwa aln -f bwa/SRR030257_1.sai bwa/NC_012967.1.fasta SRR030257_1.fastq bwa aln -f bwa/SRR030257_2.sai bwa/NC_012967.1.fasta SRR030257_2.fastq
Again, take a look at your output directory using ls bwa
to see what new files have appeared. What is a *.sai file? It's a file containing "alignment seeds" in a file format specific to bwa. Many programs produce this kind of "intermediate" file in their own format and then at the end have tools for converting things to a "community" format shared by many programs.
Alignment in bwa is a two-step process. We still need to extend these seed matches by choosing the best ones, aligning regions of the reads around them, and converting the output to SAM format.
Create the final SAM alignment file:
Submit via qsub
Use launcher_creator.py followed by qsub to submit this command.
bwa sampe -f bwa/SRR030257.sam bwa/NC_012967.1.fasta bwa/SRR030257_1.sai bwa/SRR030257_2.sai SRR030257_1.fastq SRR030257_2.fastq
Mapping with Bowtie2
Bowtie2 is a complete rewrite of Bowtie. It is currently the latest and greatest in terms of configurability, sensitivity, and speed. It is not yet installed on TACC, because it is technically still a "beta" version that is being improved.
The steps of running bowtie2 are very similar to running bowtie1.
Create a fresh output directory.
mkdir bowtie2
Convert the reference file to FASTA (again) or copy from the bowtie example. Be sure to put this file in the bowtie2
directory.
Index the reference file and map the reads
bowtie2-build NC_012967.1.fasta NC_012967.1
Submit via qsub
Use launcher_creator.py followed by qsub to submit this command.
bowtie2 -x NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S SRR030257.sam
Exercises
- Did bowtie or BWA map more reads?
- In our example, we mapped in paired-end mode using
bwa sampe
. Try to figure out how to map the reads in single-end mode. - What are the limitations of each read mapper?
- Which would we want to use if we were mapping 454 data?
- Which aligner took less time to run?
- Are there any options you can change so that more reads are mapped or that speed up performance without causing many fewer reads to be mapped?
Advanced Exercise
- MAQ is also available as a module on TACC. You can also try figuring out how to run it on this example data. It was created before SAM format, so you have to use a super-secret ninja command to convert its output to SAM:
Other than that, you should be able to use the manual to get through the steps.
module load samtools $TACC_SAMTOOLS_DIR/misc/maq2sam-short
Next up
Continue to use your mapped reads in the Introduction to variant calling (SAMtools).