$SCRATCH/core_ngs/references
Table of Contents |
---|
Overview
...
This should be second nature by now
Sample
...
datasets
First stage the sample datasets and references we will use.You have already worked with a paired-end yeast ChIP-seq dataset, which we will continue to use here. For the sake of uniformity, however, we will set up a new directory in your scratch area called 'alignment' and fill it with our sequencing data. To set up your scratch area properly and move into it, execute something like:
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir -p $SCRATCH/core_ngs/alignment cd/fastq cp $CORENGS/alignment/*fastq.gz $SCRATCH/core_ngs/alignment/fastq/ mkdir fastq |
Now you have created the alignment directory, moved into it, and created a subdirectory for our raw fastq files. We will be using four data sets that consist of five files (since the paired-end data set has two separate files for each of the R1 and R2 reads). To copy them over, execute something like:
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/core_ngs/alignment/fastq
cp /corral-repl/utexas/BioITeam/core_ngs_tools/alignment/*fastq.gz . |
We first moved ourselves into the fastq directory, then copied over all files that end in "fastq.gz" in the corral directory specified in the second line. These are descriptions of the five files we copied:
-p $SCRATCH/core_ngs/references
cp $CORENGS/references/* $SCRATCH/core_ngs/references/ |
These are descriptions of the files we copied:
File Name | Description | Sample |
---|---|---|
Sample_Yeast_L005_R1.cat.fastq.gz | Paired-end Illumina, First | |
File Name | Description | Sample |
Sample_Yeast_L005_R1.cat.fastq.gz | Paired-end Illumina, First of pair, FASTQ | Yeast ChIP-seq |
Sample_Yeast_L005_R2.cat.fastq.gz | Paired-end Illumina, Second of pair, FASTQ | Yeast ChIP-seq |
human_rnaseq.fastq.gz | Paired-end Illumina, First of pair only, FASTQ | Human RNA-seq |
human_mirnaseq.fastq.gz | Single-end Illumina, FASTQ | Human microRNA-seq |
cholera_rnaseq.fastq.gz | Single-end Illumina, FASTQ | V. cholerae RNA-seq |
...
Before we get to alignment, we need a genome reference to align to. We will use four different references here:
- the yeast genome (sacCer3)
- the human genome (hg19)
- a Vibrio cholerae genome (0395; our name: vibCho)
- the microRNA database mirbase (v20), human subset
This is usually an organism's genome, but can also be any set of names sequences, such as a transcriptome or other set of genes.
Here are the four reference genomes we will be using today, with some information about them. These NOTE: For the sake of simplicity, these are not necessarily the most recent versions of these references - for example, hg19 is the second most recent human genome, with the most recent called hg38. Similarly, the most recent mirbase annotation is v21.These are the four reference genomes we will be using today, with some information about them (and here is (e.g. the newest human reference genome is hg38 and the most recent miRBase annotation is v21. (See here for information about many more genomes.):
Reference | Species | Base Length | Contig Number | Source | Download |
---|---|---|---|---|---|
hg19 | Human | 3.1 Gbp | 25 (really 93) | UCSC | UCSC GoldenPath |
sacCer3 | Yeast | 12.2 Mbp | 17 | UCSC | UCSC GoldenPath |
mirbase V20v20 | Human subset | 160 Kbp | 1908 | MirbasemiRBase | Mirbase miRBase Downloads |
vibCho (O395) | V. Vibrio cholerae | ~4 Mbp | 2 | GenBank | GenBank Downloads |
Searching genomes is computationally hard work and takes a long time if done on un-indexed, linear genomic sequence. So So aligners require that references first be indexed to accelerate later retrievallookup. The The aligners we are using each require a different index, but use the same method (the Burrows-Wheeler Transform) to get the job done. This
Building a reference index involves taking a FASTA file as input, with each chromosome (or contig) as a separate FASTA entry, and producing an aligner-specific set of files as output. Those output index files are then used by the aligner when performing the to perform the sequence alignment, and subsequent alignments are reported using coordinates referencing the original FASTA reference files.
hg19 is way too big for us to index here, so we're not going to do it (especially not all at the same time!). Instead, we will "point" to an existing set of hg19 index files, which are all located at:
Code Block | ||||
---|---|---|---|---|
| ||||
/scratch/01063/abattenh/ref_genome/bwa/bwtsw/hg19 |
names and offset positions based on the original FASTA file contig entries.
hg19 is way too big for us to index here, so we're not going to do it (especially not all at the same time!). Instead, we will use an existing set of BWA hg19 index files which the BioITeam maintains located atWe can index the references for the yeast genome, the human miRNAs, and the V. cholerae genome, because they are all tiny compared to the human genome. We will grab the FASTA files for yeast and human miRNAs two references and build each index right before we use them. We will also grab the special file that contains the V. cholerae genome sequence and annotations (a .gbk file), and generate the reference FASTA and some other interesting information when we get to that exercise. These references are currently at the following locations:
Code Block | ||||
---|---|---|---|---|
| ||||
/corral-repl/utexas/BioITeam/core_ngs_tools/references/sacCer3.fa
/corral-repl/utexas/BioITeam/core_ngs_tools/references/hairpin_cDNA_hsa.fa
/corral-repl/utexas/BioITeam/core_ngs_tools/references/vibCho.O395.gbk | ||||
| ||||
/work/projects/BioITeam/ref_genome/bwa/bwtsw/hg19 |
We can index the references for the yeast genome, the human miRNAs, and the V. cholerae genome, because they are all tiny compared to the human genome, so we'll grab the FASTA files for yeast and human miRNAs references and build each index right before we use them. We will also obtain the special GenBank file that contains both the V. cholerae genome sequence and annotations (a .gbk file). These files, which you staged above, areFirst stage all the reference files in your $WORK core_ngs area in a directory called references. We will add further structure to this directory later on in specific exercises, but for now the following will suffice:
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir -p $WORK/core_ngs/references
cp $CLASSDIR/references/* $WORK/core_ngs/references
$CLASSDIR = /corral-repl/utexas/BioITeam/core_ngs_tools |
...
| |
/work/projects/BioITeam/projects/courses/Core_NGS_Tools/references/sacCer3.fa
/work/projects/BioITeam/projects/courses/Core_NGS_Tools/references/hairpin_cDNA_hsa.fa
/work/projects/BioITeam/projects/courses/Core_NGS_Tools/references/vibCho.O395.gbk |
Exercise #1: BWA global alignment – Yeast ChIP-seq
...
We will perform a global alignment of the paired-end Yeast ChIP-seq sequences using bwa. This workflow generally has the following steps:
- Trim the FASTQ sequences down to 50 with fastx_clipper
- this removes most of any 5' adapter contamination without the fuss of specific adapter trimming w/cutadapt
- Prepare the sacCer3 reference index for bwa using bwa index
- this is done once, and re-used for later alignments
- Perform a global bwa alignment on the R1 reads (bwa aln) producing a BWA-specific binary .sai intermediate file
- Perform a global bwa alignment on the R2 reads (bwa aln) producing a BWA-specific binary .sai intermediate file
- Perform pairing of the separately aligned reads and report the alignments in SAM format using bwa sampe
- Convert the SAM file to a BAM file (samtools view)
- Sort the BAM file by genomic location (samtools sort)
- Index the BAM file (samtools index)
- Gather simple alignment statistics (samtools flagstat and samtools idxstat)
We're going to skip the trimming step for now and see how it goes. We'll perform steps 2 - 5 now and leave samtools for the next course section, exercise since those steps ( 6 - 10 ) are common to nearly all post-alignment workflows.
...