Page History

...

Code Block

language	bash
title	"Permanent" location for original data

mkdir $SCRATCH/core_ngs/alignment
cd $SCRATCH/core_ngs/alignment
mkdir fastq

Now you have created the alignment directory, moved into it, and created a subdirectory for our raw fastq files. We will be using four data sets that consist of five files (since the paired-end data set has two separate files for each of the R1 and R2 reads). To copy them over, execute something like:

Code Block
cd $SCRATCH/core_ngs/alignment/fastq cp /corral-repl/utexas/BioITeam/core_ngs_tools/alignment/*fastq.gz .

...

File Name	Description	Sample
Sample_Yeast_L005_R1.cat.fastq.gz	Paired-end Illumina, First of pair, FASTQ	Yeast ChIP-seq
Sample_Yeast_L005_R2.cat.fastq.gz	Paired-end Illumina, Second of pair, FASTQ	Yeast ChIP-seq
human_rnaseq.fastq.gz	Paired-end Illumina, First of pair only, FASTQ	Human RNA-seq
human_mirnaseq.fastq.gz	Single-end Illumina, FASTQ	Human microRNA-seq
cholera_rnaseq.fastq.gz	Single-end Illumina, FASTQ	V. cholerae RNA-seq

First copy the two human datasets to your $SCRATCH/core_ngs/fastq_prep directory.

Code Block

language	bash
title	Stage human FASTQ data

cd $SCRATCH/core_ngs/fastq_prep
cp $CLASSDIR/human_stuff/*rnaseq.fastq.gz .

Create a $SCRATCH/core_ngs/align directory and make a link to the fastq_prep directory.

Code Block

language	bash
title	Prepare align directory

mkdir -p $SCRATCH/core_ngs/align
cd $SCRATCH/core_ngs/align
ln -s -f ../fastq_prep fq
ls -l 
ls fq

Reference Genomes

Before we get to alignment, we need a genome to align to. We will use three different references here:

the human genome (hg19)
the yeast genome (sacCer3)
and mirbase (v20), human subset

Reference Genomes

Before we get to alignment, we need a genome to align to. We will use four different references here:

the human genome (hg19)
the yeast genome (sacCer3)
the microRNA database mirbase (v20), human subset
a Vibrio cholerae genome (0395; our name: vibCho)

NOTE: For the sake of simplicity, these NOTE: These are not necessarily the most recent versions of these references - for example, hg19 is the second most recent human genome, with the most recent called hg38. Similarly, the most recent mirbase annotation is v21.

...

Expand

title	If it's simpler and faster, would one ever want to align a miRNA dataset to hg19 rather than mirbase? If so, why?

Due to natural variation, sequencing errors, and processing issues, variation between reference sequence and sample sequence is always possible. Alignment to the human genome allows a putative "microRNA" read the opportunity to find a better alignment in a region of the genome that is not an annotated microRNA. If this occurs, we might think that a read represents a microRNA (since it aligned in the mirbase alignment), when it is actually more likely to have come from a non-miRNA area of the genome. This is a major complication involved when determining, for example, whether a potential miRNA is produced from a repetitive region.
If we suspect our library contained other RNA species, we may want to quantify the level of "contamination". Aligning to the human genome will allow rRNA, tRNA, snoRNA, etc to align. We can then use programs such as bedtools, coupled with appropriate genome annotation files, to quantify these "off-target" hits. This is particularly plausible if, after a miRNA sequencing run, the alignment rate to mirbase is very low.

These are the three four reference genomes we will be using today, with some information about them (and here is information about many more genomes):

Reference	Species	Base Length	Contig Number	Source	Download
hg19	Human	3.1 Gbp	25 (really 93)	UCSC	UCSC GoldenPath
sacCer3	Yeast	12.2 Mbp	17	UCSC	UCSC GoldenPath
mirbase V20	Human	160 Kbp	1908	Mirbase		Human	160 Kbp	1908	Mirbase	Mirbase Downloads
vibCho (O395)	V. cholerae	~4 Mbp	2	GenBank	GenBank Mirbase Downloads

Searching genomes is computationally hard work and takes a long time if done on un-indexed, linear genomic sequence. So aligners require that references first be indexed to accelerate later retrieval. The aligners we are using each require a different index, but use the same method (the Burrows-Wheeler Transform) to get the job done. This requires involves taking a FASTA file as input, with each chromosome (or contig) as a separate FASTA entry, and producing some an aligner-specific set of files as output. Those output index files are then used by the aligner when performing the sequence alignment, and subsequent alignments are reported using language coordinates referencing positions in the input original FASTA filereference files.

hg19 is way too big for us to index here, so we're not going to do it (especially not all at the same time!). Instead, all we will "point" to an existing set of hg19 index files, which are all located at:

Code Block

language	bash
title	BWA hg19 index location

/scratch/01063/abattenh/ref_genome/bwa/bwtsw/hg19

However, we can index the references for the yeast genome, the human miRNAs, and the V. cholerae genome and human miRNAs, because they are much smaller. We will grab the FASTA files for the other two references and build each index right before we use. These two reference FASTA files are located at, because they are all tiny compared to the human genome. We will grab the FASTA files for yeast and human miRNAs two references and build each index right before we use them. We will also grab the special file that contains the V. cholerae genome sequence and annotations (a "gbk" file), and generate the reference FASTA and some other interesting information when we get to that exercise. These references are currently at the following locations:

Code Block

language	bash
title	Yeast and mirbase FASTA locations

/corral-repl/utexas/BioITeam/core_ngs_tools/references/sacCer3.fa
/corral-repl/utexas/BioITeam/core_ngs_tools/references/hairpin_cDNA_hsa.fa
/corral-repl/utexas/BioITeam/core_ngs_tools/references/vibCho.O395.gbk

First stage all the yeast and mirbase reference FASTA files in your work archive "core_ngs" area in a directory called references.

We will add further structure to this directory later on in specific exercises, but for now the following will suffice:

Code Block

language	bash
title	Stage FASTA references

mkdir -p $WORK/archivecore_ngs/references/fasta
cp $CLASSDIR/references/*.fa $WORK/archive/references/fasta/

...

Page tree

Versions Compared

Old Version 74

New Version 75

Key

Reference Genomes

Reference Genomes