Page tree

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


Code Block
title"Permanent" location for original data
mkdir $SCRATCH/core_ngs/alignment
cd $SCRATCH/core_ngs/alignment
mkdir fastq

Now you have created the alignment directory, moved into it, and created a subdirectory for our raw fastq files.  We will be using four data sets that consist of five files (since the paired-end data set has two separate files for each of the R1 and R2 reads).  To copy them over, execute something like:

Code Block
cd $SCRATCH/core_ngs/alignment/fastq
cp /corral-repl/utexas/BioITeam/core_ngs_tools/alignment/*fastq.gz .


File NameDescriptionSample Illumina, First of pair, FASTQYeast ChIP-seq Illumina, Second of pair, FASTQYeast ChIP-seq
human_rnaseq.fastq.gzPaired-end Illumina, First of pair only, FASTQHuman RNA-seq
human_mirnaseq.fastq.gzSingle-end Illumina, FASTQHuman microRNA-seq
cholera_rnaseq.fastq.gzSingle-end Illumina, FASTQV. cholerae RNA-seq

First copy the two human datasets to your $SCRATCH/core_ngs/fastq_prep directory.

Code Block
titleStage human FASTQ data
cd $SCRATCH/core_ngs/fastq_prep
cp $CLASSDIR/human_stuff/*rnaseq.fastq.gz .

Create a $SCRATCH/core_ngs/align directory and make a link to the fastq_prep directory.

Code Block
titlePrepare align directory
mkdir -p $SCRATCH/core_ngs/align
cd $SCRATCH/core_ngs/align
ln -s -f ../fastq_prep fq
ls -l 
ls fq

Reference Genomes

Before we get to alignment, we need a genome to align to.  We will use three different references here: 

  • the human genome (hg19)
  • the yeast genome (sacCer3)
  • and mirbase (v20), human subset


Reference Genomes

Before we get to alignment, we need a genome to align to.  We will use four different references here: 

  • the human genome (hg19)
  • the yeast genome (sacCer3)
  • the microRNA database mirbase (v20), human subset
  • a Vibrio cholerae genome (0395; our name: vibCho)

NOTE: For the sake of simplicity, these are not necessarily the most recent versions of these references - for example, hg19 is the second most recent human genome, with the most recent called hg38.  Similarly, the most recent mirbase annotation is v21.


titleIf it's simpler and faster, would one ever want to align a miRNA dataset to hg19 rather than mirbase? If so, why?
  1. Due to natural variation, sequencing errors, and processing issues, variation between reference sequence and sample sequence is always possible. Alignment to the human genome allows a putative "microRNA" read the opportunity to find a better alignment in a region of the genome that is not an annotated microRNA. If this occurs, we might think that a read represents a microRNA (since it aligned in the mirbase alignment), when it is actually more likely to have come from a non-miRNA area of the genome. This is a major complication involved when determining, for example, whether a potential miRNA is produced from a repetitive region.
  2. If we suspect our library contained other RNA species, we may want to quantify the level of "contamination". Aligning to the human genome will allow rRNA, tRNA, snoRNA, etc to align. We can then use programs such as bedtools, coupled with appropriate genome annotation files, to quantify these "off-target" hits. This is particularly plausible if, after a miRNA sequencing run, the alignment rate to mirbase is very low.

These are the three four reference genomes we will be using today, with some information about them (and here is information about many more genomes):

ReferenceSpeciesBase LengthContig NumberSourceDownload
hg19Human3.1 Gbp25 (really 93)UCSCUCSC GoldenPath
sacCer3Yeast12.2 Mbp17UCSCUCSC GoldenPath
mirbase V20Human160 Kbp1908MirbaseHuman160 Kbp1908MirbaseMirbase Downloads
vibCho (O395)V. cholerae~4 Mbp2GenBankGenBank Downloads

Searching genomes is computationally hard work and takes a long time if done on un-indexed, linear genomic sequence.  So aligners require that references first be indexed to accelerate later retrieval.  The aligners we are using each require a different index, but use the same method (the Burrows-Wheeler Transform) to get the job done. This requires involves taking a FASTA file as input, with each chromosome (or contig) as a separate FASTA entry, and producing some an aligner-specific set of files as output. Those output index files are then used by the aligner when performing the sequence alignment, and subsequent alignments are reported using language coordinates referencing positions in the input original FASTA filereference files. 

hg19 is way too big for us to index here, so we're not going to do it (especially not all at the same time!). Instead, all we will "point" to an existing set of hg19 index files, which are all located at:

Code Block
titleBWA hg19 index location

However, we can index the references for the yeast genome, the human miRNAs, and the V. cholerae genome and human miRNAs, because they are much smaller.  We will grab the FASTA files for the other two references and build each index right before we use. These two reference FASTA files are located at, because they are all tiny compared to the human genome.  We will grab the FASTA files for yeast and human miRNAs two references and build each index right before we use them.  We will also grab the special file that contains the V. cholerae genome sequence and annotations (a "gbk" file), and generate the reference FASTA and some other interesting information when we get to that exercise. These references are currently at the following locations:

Code Block
titleYeast and mirbase FASTA locations

First stage all the yeast and mirbase reference FASTA files in your work archive  "core_ngs" area in a directory called references.


 We will add further structure to this directory later on in specific exercises, but for now the following will suffice:

Code Block
titleStage FASTA references
mkdir -p $WORK/archivecore_ngs/references/fasta
cp $CLASSDIR/references/*.fa $WORK/archive/references/fasta/