Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titleCopy alignment results
mkdir -p $SCRATCH/core_ngs/results
cd $SCRATCH/core_ngs/results
cp /corral-repl/utexas/BioITeam/core_ngs_tools/results/*.* .

 

Overview and Objectives

Once After raw sequence files are generated (in FASTQ format) and , quality-checked, and pre-processed in some way, the next step in most NGS pipelines is mapping to a reference genome. For individual sequences of interest, it is common to use a tool like BLAST to identify genes or species of origin. However, a typical example normal NGS dataset may will have tens to hundreds of millions of readssequences, which BLAST and similar tools are not designed to handle.

Thus, a large set of computational tools have been developed to quickly, and with sufficient (but not absoluteabsolute - and this tradeoff is an important consideration when constructing alignment pipelines) accuracy align each read to its best location, if any, in a reference. Even though many mapping tools exist, a few individual programs have a dominant "market share" of the NGS world. These programs vary widely in their design, inputs, outputs, and applications. In this section, we will primarily focus on two of the most versatile mappers: BWA and Bowtie2, the latter being part of the Tuxedo suite (e.g. transcriptome-aware Tophat2) which also includes tools for manipulating NGS data after alignment.

Connect to login8.stampede.tacc.utexas.edu

...

  • the human genome (hg19)
  • the yeast genome (sacCer3)
  • and mirbase (v20), human subset

NOTE: These are not necessarily the most recent versions of these references - for example, hg19 is the second most recent human genome, with the most recent called hg38.  Similarly, the most recent mirbase annotation is v21.

Mirbase is a collection of all known microRNAs in all species (and many speculative miRNAs). We will use the human subset of that database as our alignment reference.  This has the advantage of being significantly smaller than the human genome, while likely containing almost all the sequences we are actually interested insequences likely to be detected in a miRNA sequencing run.

Expand
titleIf it's simpler and faster, would one ever want to align a miRNA dataset to hg19 rather than mirbase? If so, why?
  1. Due to natural variation, sequencing errors, and processing issues, variation between reference sequence and sample sequence is always possible. Alignment to the human genome allows a putative "microRNA" read the opportunity to find a better alignment in a region of the genome that is not an annotated microRNA. If this occurs, we might think that a read represents a microRNA (since it aligned in the mirbase alignment), when it is actually more likely to have come from a non-miRNA area of the genome. This is a major complication involved when determining, for example, whether a potential miRNA is produced from a repetitive region.
  2. If we suspect our library contained other RNA species, we may want to quantify the level of "contamination". Aligning to the human genome will allow rRNA, tRNA, snoRNA, etc to align. We can then use programs such as bedtools, coupled with appropriate genome annotation files, to quantify these "off-target" hits-target" hits. This is particularly plausible if, after a miRNA sequencing run, the alignment rate to mirbase is very low.

These are the three reference genomes we will be using today, with some information about them (and here is information about many more genomes):

ReferenceSpeciesBase LengthContig NumberSourceDownload
hg19Human3.1 Gbp25 (really 93)UCSCUCSC GoldenPath
sacCer3Yeast12.2 Mbp17UCSCUCSC GoldenPath
mirbase V20Human160 Kbp1908MirbaseMirbase Downloads

Searching genomes is computationally hard work and takes a long time if done on an un-indexed, linear genomic sequence.  So aligners require that references first be indexed for quick access to accelerate later retrieval.  The aligners we are using each require a different index, but use the same method (the Burrows-Wheeler Transform) to get the job done. This requires taking a FASTA file as input, with each chromosome (or contig) as a separate entry, and producing some aligner-specific set of files as output. Those index files are then used by the aligner when performing the sequence alignment, and subsequent alignments are reported using language referencing positions in the input FASTA file

hg19 is way too big for us to index here, so we're not going to do it (especially not all at the same time). Instead, all hg19 index files are located at:

Code Block
languagebash
titleBWA hg19 index location
/scratch/01063/abattenh/ref_genome/bwa/bwtsw/hg19

We However, we can index the references for the yeast genome and human miRNAs, because they are much smaller.  We will grab the FASTA files for the other two references and build each index right before we use. These two references reference FASTA files are located at:

Code Block
languagebash
titleYeast and mirbase FASTA locations
/corral-repl/utexas/BioITeam/core_ngs_tools/references/sacCer3.fa
/corral-repl/utexas/BioITeam/core_ngs_tools/references/hairpin_cDNA_hsa.fa

First stage the yeast and mirbase reference FASTA files in your work archive area in a directory called references.

...

  1. Trim the FASTQ sequences down to 50 with fastx_clipper
    • this removes most of any 5' adapter contamination without the fuss of specific adapter trimming w/cutadapt
  2. Prepare the sacCer3 reference index for bwa using bwa (one time) using bwa indexindex (this is done once, and re-used for later alignments)
  3. Perform a global bwa alignment on the R1 reads (bwa aln) producing a BWA-specific binary .sai intermediate file
  4. Perform a global bwa alignment on the R2 reads (bwa aln) producing a BWA-specific binary .sai intermediate file
  5. Perform pairing of the separately aligned reads and report the alignments in SAM format using bwa sampe
  6. Convert the SAM file to a BAM file (samtools view)
  7. Sort the BAM file by genomic location (samtools sort)
  8. Index the BAM file (samtools index)
  9. Gather simple alignment statistics (samtools flagstat and samtools idxstat)

...

Expand
titleTop-level BWA help

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.7-r441
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
      There are three alignment algorithms in BWA: `mem', `bwasw', and
      `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
      first. Please `man ./bwa.1' for the manual. 

As you can see, bwaoffers a number of sub-commands one can use with to do different thingsinclude many subcommands that perform most of the tasks we are interested in.

Building the BWA sacCer3 index

...

Code Block
Usage:   bwa index [-a bwtsw|is] [-c] <in.fasta>
Options: -a STR    BWT construction algorithm: bwtsw or is [auto]
         -p STR    prefix of the index [same as fasta name]
         -6        index files named as <in.fasta>.64.* instead of <in.fasta>.*
Warning: `-a bwtsw' does not work for short genomes, while `-a is' and
         `-a div' do not work not for long genomes. Please choose `-a'
         according to the length of the genome.

HereBased on the "Usage" description, we only need to specify two things:

...

Since sacCer3 is relative large (~12 Mbp) we will specify bwtsw as the indexing option (as indicated by the "Warning" message), and the name of the FASTA file is sacCer3.fa.

...

Since the yeast genome is not large when compared to human, this should not take long to execute (otherwise we would do it as a batch job). When it is comple complete you should see a set of index files like this:

...

Exploring the FASTA with grep

A common question is what contigs are in a given FASTA It is frequently useful to have a list of all contigs/chromosomes/genes/features in a file. You'll usually want to know this before you start the alignment so that you're familiar with the contig naming convention – and to verify that it's the one you expect.  For example, chromosome 1 is specified as "chr1", "1", "I", and more in different references, and it can get weird for non-model organisms.

We saw that a FASTA consists of a number of contig entries, each one starting with a name line of the form below, followed by many lines of bases.

...

Regular expressions are so powerful that nearly every modern computer language includes a "regex" module of some sort. There are many online tutorials for regular expressions (and a few different flavors of them). But the most common is the Perl style (http://perldoc.perl.org/perlretut.html). We're only going to use the most simple of regular expressions here, but learning more about them will pay handsome dividends for you in the future (there's a reason Perl was used a lot when assembling the human genome).

Here's how to execute grep to list contig names in a FASTA file.

...

We might be able to get away with just using this literal alone as our regex, specifying '>' as the command line argument. But for grep, the more specific the pattern, the better. So we constrain where the > can appear on the line. The special carat ( ^ ) character represents "beginning of line". So ^> means "beginning of a line followed by a > character, followed by anything. (Aside: the dollar sign ( $ ) character represents "end of line" in a regex. There are many other special characters, including period ( . ), question mark ( ? ), pipe ( | ), parentheses ( ( ) ), and brackets ( [ ] ), to name the most common.)

...

Expand
titleWhat's going on?

Parameters are:

  • --local – local alignment mode
  • -L 16 – seed length 16
  • -N 1 – allow 1 mismatch in the seed
  • -x  mb20/hairpin_cDNA_hsa.fa – prefix path of index files
  • -U fq/human_mirnaseq.fastq.gz – FASTQ file for single-end (Unpaired) alignment
  • -S human_mirnaseq.sam – tells bowtie2 to report alignments in SAM format to the specified file

...