...
Table of Contents | ||
---|---|---|
|
Overview and Objectives
Once raw sequence files are generated (in FASTQ format) and quality-checked, the next step in most NGS pipelines is mapping to a reference genome. For individual sequences of interest, it is common to use a tool like BLAST to identify genes or species of origin. However, a typical example will have NGS dataset may have tens to hundreds of millions of reads, and a reference space that is frequently billions of bases, which BLAST and similar tools are not really designed to handle.
Thus, a large set of computational tools have been developed to quickly, and with sufficient (but NOT not absolute) accuracy align each read to its best location, if any, in a reference. Even though many mapping tools exist, a few individual programs have a dominant "market share" of the NGS world. These programs vary widely in their design, inputs, outputs, and applications. In this section, we will primarily focus on two of the most versatile mappers: BWA and Bowtie2, the latter being part of the Tuxedo suite (e.g. transcriptome-aware Tophat2).
Sample Datasets
You have already worked with a paired-end yeast ChIP-seq dataset, which we will continue to use here. The paired end data should already be located at:
Code Block | ||||
---|---|---|---|---|
| ||||
$WORK/archive/original/2014_05.core_ngs$SCRATCH/YEAST_FASTQ_AFTER_DAY_2 |
We will also use two additional RNA-seq datasets, which are located at:
Code Block |
---|
/corral-repl/utexas/BioITeam/core_ngs_tools/$CLASSDIR/human_stuff |
Set up a new directory in your scratch area called 'fastq_align', and populate it with copies the following files, derived from the locations given above:
File Name | File Name | Description | Sample |
---|---|---|---|
Sample_Yeast_L005_R1.cat.fastq.gz | Paired-end Illumina, First of pair, FASTQ | Yeast ChIP-seq | |
Sample_Yeast_L005_R2.cat.fastq.gz | Paired-end Illumina, Second of pair, FASTQ | Yeast ChIP-seq | |
human_rnaseq.fastq.gz | Paired-end Illumina, First of pair only, FASTQ | Human RNA-seq | |
human_mirnaseq.fastq.gz | Single-end Illumina, FASTQ | Human microRNA-seq |
First copy the two human datasets to your $SCRATCH/core_ngs/fastq_prep directory.
Code Block | ||||
---|---|---|---|---|
| ||||
| ||||
cd $SCRATCH/core_ngs/fastq_prep Code Block |
cp $SCRATCH/YEAST_FASTQ_AFTER_DAY_2 /corral-repl/utexas/BioITeam/core_ngs_tools/$CLASSDIR/human_stuff/*rnaseq.fastq.gz . |
Do a fast quality check on the two new data files like you did earlier on the yeast files, and move all files and directories that are produced from the fastQC commands into a new subdirectory called 'fastqc_out'.
Create a $SCRATCH/core_ngs/align directory and make a link to the fastq_prep directory.
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir -p $SCRATCH/core_ngs/align
cd $SCRATCH/core_ngs/align
ln -s -f ../fastq_prep fq
ls -l
ls fq | ||||
Expand | ||||
| ||||
Code Block | |
Reference Genomes
Before we get to alignment, we need a genome to align to. We will use three different references here: the
- the human genome (hg19)
...
- the yeast genome (sacCer3)
...
- and mirbase (v20), human subset
Mirbase . Mirbase is a collection of all known microRNAs in all species, and we . We will use the human subset of that database as our alignment reference. This has the advantage of being significantly smaller than the human genome, while containing all the sequences we are actually interested in.
Expand | ||
---|---|---|
| ||
|
These are the three reference genomes we will be using today, with some information about them (and here is information about many more genomes):
...