Page History

$SCRATCH/core_ngs/references

Table of Contents

Overview

...

This should be second nature by now

Sample

...

datasets

First stage the sample datasets and references we will use.You have already worked with a paired-end yeast ChIP-seq dataset, which we will continue to use here. For the sake of uniformity, however, we will set up a new directory in your scratch area called 'alignment' and fill it with our sequencing data. To set up your scratch area properly and move into it, execute something like:

Code Block

language	bash
title	"Permanent" location for original data

mkdir -p $SCRATCH/core_ngs/alignment
cd/fastq
cp $CORENGS/alignment/*fastq.gz $SCRATCH/core_ngs/alignment/fastq/

mkdir fastq

Now you have created the alignment directory, moved into it, and created a subdirectory for our raw fastq files. We will be using four data sets that consist of five files (since the paired-end data set has two separate files for each of the R1 and R2 reads). To copy them over, execute something like:

Code Block

language	bash

cd $SCRATCH/core_ngs/alignment/fastq
cp /corral-repl/utexas/BioITeam/core_ngs_tools/alignment/*fastq.gz .

We first moved ourselves into the fastq directory, then copied over all files that end in "fastq.gz" in the corral directory specified in the second line. These are descriptions of the five files we copied:

 -p $SCRATCH/core_ngs/references
cp $CORENGS/references/* $SCRATCH/core_ngs/references/

These are descriptions of the files we copied:

File Name	Description	Sample
Sample_Yeast_L005_R1.cat.fastq.gz	Paired-end Illumina, First
File Name	Description	Sample
Sample_Yeast_L005_R1.cat.fastq.gz	Paired-end Illumina, First of pair, FASTQ	Yeast ChIP-seq
Sample_Yeast_L005_R2.cat.fastq.gz	Paired-end Illumina, Second of pair, FASTQ	Yeast ChIP-seq
human_rnaseq.fastq.gz	Paired-end Illumina, First of pair only, FASTQ	Human RNA-seq
human_mirnaseq.fastq.gz	Single-end Illumina, FASTQ	Human microRNA-seq
cholera_rnaseq.fastq.gz	Single-end Illumina, FASTQ	V. cholerae RNA-seq

...

Before we get to alignment, we need a genome reference to align to. We will use four different references here:

the yeast genome (sacCer3)
the human genome (hg19)
a Vibrio cholerae genome (0395; our name: vibCho)
the microRNA database mirbase (v20), human subset

This is usually an organism's genome, but can also be any set of names sequences, such as a transcriptome or other set of genes.

Here are the four reference genomes we will be using today, with some information about them. These NOTE: For the sake of simplicity, these are not necessarily the most recent versions of these references - for example, hg19 is the second most recent human genome, with the most recent called hg38. Similarly, the most recent mirbase annotation is v21.These are the four reference genomes we will be using today, with some information about them (and here is (e.g. the newest human reference genome is hg38 and the most recent miRBase annotation is v21. (See here for information about many more genomes.):

Reference	Species	Base Length	Contig Number	Source	Download
hg19	Human	3.1 Gbp	25 (really 93)	UCSC	UCSC GoldenPath
sacCer3	Yeast	12.2 Mbp	17	UCSC	UCSC GoldenPath
mirbase V20v20	Human subset	160 Kbp	1908	MirbasemiRBase	Mirbase miRBase Downloads
vibCho (O395)	V. Vibrio cholerae	~4 Mbp	2	GenBank	GenBank Downloads

Searching genomes is computationally hard work and takes a long time if done on un-indexed, linear genomic sequence. So So aligners require that references first be indexed to accelerate later retrievallookup. The The aligners we are using each require a different index, but use the same method (the Burrows-Wheeler Transform) to get the job done. This

Building a reference index involves taking a FASTA file as input, with each chromosome (or contig) as a separate FASTA entry, and producing an aligner-specific set of files as output. Those output index files are then used by the aligner when performing the to perform the sequence alignment, and subsequent alignments are reported using coordinates referencing the original FASTA reference files.

hg19 is way too big for us to index here, so we're not going to do it (especially not all at the same time!). Instead, we will "point" to an existing set of hg19 index files, which are all located at:

Code Block

language	bash
title	BWA hg19 index location

/scratch/01063/abattenh/ref_genome/bwa/bwtsw/hg19

names and offset positions based on the original FASTA file contig entries.

hg19 is way too big for us to index here, so we're not going to do it (especially not all at the same time!). Instead, we will use an existing set of BWA hg19 index files which the BioITeam maintains located atWe can index the references for the yeast genome, the human miRNAs, and the V. cholerae genome, because they are all tiny compared to the human genome. We will grab the FASTA files for yeast and human miRNAs two references and build each index right before we use them. We will also grab the special file that contains the V. cholerae genome sequence and annotations (a .gbk file), and generate the reference FASTA and some other interesting information when we get to that exercise. These references are currently at the following locations:

Code Block

language	bash
title	Yeast and mirbase FASTA locations

/corral-repl/utexas/BioITeam/core_ngs_tools/references/sacCer3.fa
/corral-repl/utexas/BioITeam/core_ngs_tools/references/hairpin_cDNA_hsa.fa
/corral-repl/utexas/BioITeam/core_ngs_tools/references/vibCho.O395.gbk

BWA hg19 index location

/work/projects/BioITeam/ref_genome/bwa/bwtsw/hg19

We can index the references for the yeast genome, the human miRNAs, and the V. cholerae genome, because they are all tiny compared to the human genome, so we'll grab the FASTA files for yeast and human miRNAs references and build each index right before we use them. We will also obtain the special GenBank file that contains both the V. cholerae genome sequence and annotations (a .gbk file). These files, which you staged above, areFirst stage all the reference files in your $WORK core_ngs area in a directory called references. We will add further structure to this directory later on in specific exercises, but for now the following will suffice:

Code Block

language	bash
title	Stage Yeast and mirbase FASTA references

mkdir -p $WORK/core_ngs/references
cp $CLASSDIR/references/* $WORK/core_ngs/references
$CLASSDIR = /corral-repl/utexas/BioITeam/core_ngs_tools

...

locations

/work/projects/BioITeam/projects/courses/Core_NGS_Tools/references/sacCer3.fa
/work/projects/BioITeam/projects/courses/Core_NGS_Tools/references/hairpin_cDNA_hsa.fa
/work/projects/BioITeam/projects/courses/Core_NGS_Tools/references/vibCho.O395.gbk

Exercise #1: BWA global alignment – Yeast ChIP-seq

...

We will perform a global alignment of the paired-end Yeast ChIP-seq sequences using bwa. This workflow generally has the following steps:

Trim the FASTQ sequences down to 50 with fastx_clipper
- this removes most of any 5' adapter contamination without the fuss of specific adapter trimming w/cutadapt
Prepare the sacCer3 reference index for bwa using bwa index
- this is done once, and re-used for later alignments
Perform a global bwa alignment on the R1 reads (bwa aln) producing a BWA-specific binary .sai intermediate file
Perform a global bwa alignment on the R2 reads (bwa aln) producing a BWA-specific binary .sai intermediate file
Perform pairing of the separately aligned reads and report the alignments in SAM format using bwa sampe
Convert the SAM file to a BAM file (samtools view)
Sort the BAM file by genomic location (samtools sort)
Index the BAM file (samtools index)
Gather simple alignment statistics (samtools flagstat and samtools idxstat)

We're going to skip the trimming step for now and see how it goes. We'll perform steps 2 - 5 now and leave samtools for the next course section, exercise since those steps ( 6 - 10 ) are common to nearly all post-alignment workflows.

...

Page tree

Versions Compared

Old Version 128

New Version 129

Key

Overview

Sample

datasets

Exercise #1: BWA global alignment – Yeast ChIP-seq