Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents
excludeTable of Contents

Overview and Objectives

Once raw sequence files are generated (in FASTQ format) and quality-checked, the next step in most NGS pipelines is mapping to a reference genome. For individual sequences of interest, it is common to use a tool like BLAST to identify genes or species of origin. However, a typical example will have NGS dataset may have tens to hundreds of millions of reads, and a reference space that is frequently billions of bases, which BLAST and similar tools are not really designed to handle.

Thus, a large set of computational tools have been developed to quickly, and with sufficient (but NOT not absolute) accuracy align each read to its best location, if any, in a reference. Even though many mapping tools exist, a few individual programs have a dominant "market share" of the NGS world. These programs vary widely in their design, inputs, outputs, and applications. In this section, we will primarily focus on two of the most versatile mappers: BWA and Bowtie2, the latter being part of the Tuxedo suite (e.g. transcriptome-aware Tophat2).

Sample Datasets

You have already worked with a paired-end yeast ChIP-seq dataset, which we will continue to use here.  The paired end data should already be located at:

$SCRATCH/YEAST_FASTQ_AFTER_DAY_2
Code Block
languagebash
title"Permanent" location for original data
$WORK/archive/original/2014_05.core_ngs

We will also use two additional RNA-seq datasets, which are located at:

Code Block
/corral-repl/utexas/BioITeam/core_ngs_tools/$CLASSDIR/human_stuff

Set up a new directory in your scratch area called 'fastq_align', and populate it with copies the following files, derived from the locations given above:

File NameFile NameDescriptionSample
Sample_Yeast_L005_R1.cat.fastq.gzPaired-end Illumina, First of pair, FASTQYeast ChIP-seq
Sample_Yeast_L005_R2.cat.fastq.gzPaired-end Illumina, Second of pair, FASTQYeast ChIP-seq
human_rnaseq.fastq.gzPaired-end Illumina, First of pair only, FASTQHuman RNA-seq
human_mirnaseq.fastq.gzSingle-end Illumina, FASTQHuman microRNA-seq

...

First copy the two human datasets to your $SCRATCH/core_ngs/fastq_prep directory.

cds mkdir fastq_align cd fastq_align cp $SCRATCH/YEAST_FASTQ_AFTER_DAY_2 /corral-repl/utexas/BioITeam/core_ngs_tools/
Code Block
languagebash
titleHint:
Code Block
Stage human FASTQ data
cd $SCRATCH/core_ngs/fastq_prep
cp $CLASSDIR/human_stuff/*rnaseq.fastq.gz .

Do a fast quality check on the two new data files like you did earlier on the yeast files, and move all files and directories that are produced from the fastQC commands into a new subdirectory called 'fastqc_out'.

...

titleHint:

...

Create a $SCRATCH/core_ngs/align directory and make a link to the fastq_prep directory.

Code Block
languagebash
titlePrepare align directory
mkdir -p $SCRATCH/core_ngs/align
cd $SCRATCH/core_ngs/align
ln -s -f ../fastq_prep fq
ls -l 
ls fq

Reference Genomes

Before we get to alignment, we need a genome to align to.  We will use three different references here:  the  

  • the human genome (hg19)

...

  • the yeast genome (sacCer3)

...

  • and mirbase (v20)

...

  • , human subset

Mirbase is a collection of all known microRNAs in all species, and we . We will use the human subset of that database as our alignment reference.  This has the advantage of being significantly smaller than the human genome, while containing all the sequences we are actually interested in.

Expand
titleIf it's simpler and faster, would one ever want to align a miRNA dataset to hg19 rather than mirbase? If so, why?
  1. Due to natural variation, sequencing errors, and processing issues, variation between reference sequence and sample sequence is always possible. Alignment to the human genome allows a putative "microRNA" read the opportunity to find a better alignment in a region of the genome that is not an annotated
microRNA relative to the
  1. microRNA
reference sequence
  1. . If this occurs, we might think that a read represents a microRNA (since it aligned in the mirbase alignment), when it is actually more likely to have come from a non-miRNA area of the genome.
  2. If we suspect our library contained other RNA species, we may want to quantify the level of "contamination". Aligning to the human genome will allow rRNA, tRNA, snoRNA, etc to align. We can then use programs such as bedtools, coupled with appropriate genome annotation files, to quantify these "off-target" hits.

These are the three reference genomes we will be using today, with some information about them (and here is information about many more genomes):

...