You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

The first step in nearly every next-gen sequence analysis pipeline is to map sequencing reads to a reference genome. In this tutorial we'll run some common mapping tools on TACC.

The world of read mappers seems to be settling down a bit after being a bioinformatics Wild West where there was a new gun in town every week that promised to be a faster shot than the current record holder. Things seem to have reached the point where there is mainly a trade-off between speed and accuracy among read mappers that have remained popular.

There are over 50 read mapping programs listed here. We're going to (mainly) stick to just two in this course.

Each mapper has its own set of limitations (on the lengths of reads it accepts, on how it outputs read alignments, on how many mismatches there can be, on whether it produces gapped alignments, on whether it supports SOLiD colorspace data, etc.).

Mapping tools summary

Tool

TACC

Version

Download

Manual

Example

BWA

module load bwa/0.6.2

0.6.1; 0.6.2

link

link

#BWA

Bowtie

module load bowtie/0.12.8 

0.12.8

link

link

#Bowtie

Bowtie2

module load bowtie/2.0.0b6

2.0.0b6

link

link

#Bowtie2

Example: E. coli genome re-sequencing data

Data

The data files for this example are in the path:

/corral-repl/utexas/BioITeam/ngs_course/intro_to_mapping/data

File Name

Description

Sample

SRR030257_1.fastq

Paired-end Illumina, First of pair, FASTQ format

Re-sequenced E. coli genome

SRR030257_2.fastq

Paired-end Illumina, Second of pair, FASTQ format

Re-sequenced E. coli genome

NC_012967.1.gbk

Reference Genome in Genbank format

E. coli B strain REL606

The easiest way to run the tutorial is to copy this entire directory to your $SCRATCH space and then run all of the commands from inside that directory. See if you can figure out how to do that. When you're in the right place, you should get output like this from the ls command.

login1$ ls
NC_012967.1.gbk  SRR030257_1.fastq  SRR030257_2.fastq
cds
cp -r /corral-repl/utexas/BioITeam/ngs_course/intro_to_mapping/data intro_to_mapping
cd intro_to_mapping

Exercises

  • What basic linux commands could help us quickly peek at the files that were just copied to get an idea of their contents?
    head
    tail
    wc -l
    
  • How many sequences are in the file SRR030257_1.fastq?
    grep ^@ SRR030257_1.fastq | wc -l
    grep --count ^@ SRR030257_1.fastq
    
  • How many bases long are the reads in SRR030257_1.fastq?
    sed -n 2p SRR030257_1.fastq | awk -F"[ATCGatcg]" '{print NF-1}'
    

    ... and there must be more clever ways than this.

Converting sequence file formats

Occasionally you might download a sequence or have it emailed to you by a collaborator in one format, and then the program that you want to use demands that it be in another format. Why do they have to be so picky?

The bp_seqconvert.pl script that is installed as part of Bioperl is one helpful utility for converting between many common sequence formats. On TACC, the Bioperl modules are installed, but the helper script isn't. So, we've put it in a place that you can copy it from for your convenience. However, remember that any time that you use the script you must have the bioperl module loaded.

cp /corral-repl/utexas/BioITeam/ngs_course/scripts/bp_seqconvert.pl .

Run the script without any arguments to get the help message:

module load bioperl
./bp_seqconvert.pl

Exercises

The file NC_012967.1.gbk is in Genbank format. The files SRR030257_*.fastq are in FASTQ format.

  • Convert NC_012967.1.gbk to EMBL format. Call the output NC_012967.1.embl.
    • Does EMBL format have sequence features (like genes) annotated?
      ./bp_seqconvert.pl --from genbank --to embl < NC_012967.1.gbk > NC_012967.1.embl
      head -n 100 NC_012967.1.embl
      
  • Convert the first 10,000 lines of SRR030257_1.fastq to FASTA format.
    • What information was lost by this conversion?
      head -n 10000 SRR030257_1.fastq | ./bp_seqconvert.pl --from fastq --to fasta > SRR030257_1.fasta
      head SRR030257_1.fastq
      head SRR030257_1.fasta
      

      The line of funny ASCII characters was lost. Those are your "base quality scores".

Advanced Exercise

  • Write your own Genbank to FASTA sequence converter using Perl, Python, or shell scripting. Can you make it faster than the BioPerl one?

Extra reading

The first portion of a Genbank file contains information about "features" of the genome, like genes. The second part contains the actual bases of the reference sequence. Therefore, a Genbank essentially has an embedded FASTA file inside it. There are also a lot of nice statistics and metadata, like the size of the sequence in the GenBank header.

Mapping with bowtie

Load the bowtie module

module load bowtie
module list

Create a fresh output directory.

mkdir bowtie

Convert the reference file from GenBank to FASTA using what you learned above. Name the new output file NC_012967.1.fasta AND put it in the new directory bowtie.

./bp_seqconvert.pl --from genbank --to fasta < NC_012967.1.gbk > bowtie/NC_012967.1.fasta

Next, index the reference file:

bowtie-build bowtie/NC_012967.1.fasta bowtie/NC_012967.1

Take a look at your output directory using ls bowtie to see what new files have appeared.

Why do we create an index?

Like an index for a book (in the olden days before Kindles and Nooks), creating an index for a computer database allows quick access to any "record" given a short "key". In the case of mapping programs, creating an index for a reference sequence allows it to more rapidly place a read on that sequence at a location where it knows at least a piece of the read matches perfectly or with only a few mismatches. By jumping right to these spots in the genome, rather than trying to align the read to every place in the genome, it saves a ton of time.

Indexing is a separate step in running most mapping programs because it can take a LONG time if you are indexing a very large genome (like our own overly complicated human genome). Furthermore, you only need to index a genome sequence once, no matter how many samples you want to map.

Finally, map the reads:

Submit via qsub

Use launcher_creator.py followed by qsub to submit this command.

bowtie -t --sam bowtie/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq bowtie/SRR030257.sam

What does the -t option do?

Your final output file is in SAM format. It's just a text file, so you can peek at it and see what it's like inside. Two warnings though:

  1. SAM files can be enormously humongous text files (maybe >1 gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like head or grep or using a viewer like IGV, which we will cover later.
  2. SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field (??) and CIGAR strings. We'll take a look at some of these later, if we have time.

Still, you should recognize some of the information on a line in a SAM file from the input FASTQ, and some of the other information is relatively straightforward to understand, like the position where the read mapped. Give this a try:

head bowtie/SRR030257.sam

What do you think the 4th and 8th columns mean?

More reading about SAM files

Mapping with BWA

BWA (the Burrows-Wheeler Aligner) is another fast mapping program. It's speedier successor to another aligner you might have used or heard of called MAQ (Mapping and Assembly with Quality).

The steps of running BWA are very similar to running bowtie.

Load the module:

module load bwa

There are multiple versions of BWA on TACC, so you might want to check which one you have loaded for when you write up your paper.

module spider bwa
module list
bwa

Create a fresh output directory.

mkdir bwa

Convert the reference file to FASTA (again) or copy from the bowtie example. Be sure to put this file in the bwa directory.

Index the reference file.

bwa index bwa/NC_012967.1.fasta

Take a look at your output directory using ls bwa to see what new files have appeared.

Find the locations where reads map:

Submit via qsub

Use launcher_creator.py followed by qsub to submit these commands.

bwa aln -f bwa/SRR030257_1.sai bwa/NC_012967.1.fasta SRR030257_1.fastq
bwa aln -f bwa/SRR030257_2.sai bwa/NC_012967.1.fasta SRR030257_2.fastq

Again, take a look at your output directory using ls bwa to see what new files have appeared. What is a *.sai file? It's a file containing "alignment seeds" in a file format specific to bwa. Many programs produce this kind of "intermediate" file in their own format and then at the end have tools for converting things to a "community" format shared by many programs.

Alignment in bwa is a two-step process. We still need to extend these seed matches by choosing the best ones, aligning regions of the reads around them, and converting the output to SAM format.

Create the final SAM alignment file:

Submit via qsub

Use launcher_creator.py followed by qsub to submit this command.

bwa sampe -f bwa/SRR030257.sam bwa/NC_012967.1.fasta bwa/SRR030257_1.sai bwa/SRR030257_2.sai SRR030257_1.fastq SRR030257_2.fastq

Mapping with Bowtie2

Bowtie2 is a complete rewrite of Bowtie. It is currently the latest and greatest in terms of configurability, sensitivity, and speed. It is not yet installed on TACC, because it is technically still a "beta" version that is being improved.

The steps of running bowtie2 are very similar to running bowtie1.

Create a fresh output directory.

mkdir bowtie2

Convert the reference file to FASTA (again) or copy from the bowtie example. Be sure to put this file in the bowtie2 directory.

Index the reference file and map the reads

bowtie2-build NC_012967.1.fasta NC_012967.1

Submit via qsub

Use launcher_creator.py followed by qsub to submit this command.

bowtie2 -x NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S SRR030257.sam

Exercises

  • Did bowtie or BWA map more reads?
  • In our example, we mapped in paired-end mode using bwa sampe. Try to figure out how to map the reads in single-end mode.
  • What are the limitations of each read mapper?
    • Which would we want to use if we were mapping 454 data?
  • Which aligner took less time to run?
    • Are there any options you can change so that more reads are mapped or that speed up performance without causing many fewer reads to be mapped?

Advanced Exercise

  • MAQ is also available as a module on TACC. You can also try figuring out how to run it on this example data. It was created before SAM format, so you have to use a super-secret ninja command to convert its output to SAM:
    module load samtools
    $TACC_SAMTOOLS_DIR/misc/maq2sam-short
    
    Other than that, you should be able to use the manual to get through the steps.

Next up

Continue to use your mapped reads in the Introduction to variant calling (SAMtools).

  • No labels