...
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir $SCRATCH/core_ngs/alignment cd $SCRATCH/core_ngs/alignment mkdir fastq |
Now you have created the alignment directory, moved into it, and created a subdirectory for our raw fastq files. We will be using four data sets that consist of five files (since the paired-end data set has two separate files for each of the R1 and R2 reads). To copy them over, execute something like:
Code Block |
---|
cd $SCRATCH/core_ngs/alignment/fastq
cp /corral-repl/utexas/BioITeam/core_ngs_tools/alignment/*fastq.gz . |
...
File Name | Description | Sample |
---|---|---|
Sample_Yeast_L005_R1.cat.fastq.gz | Paired-end Illumina, First of pair, FASTQ | Yeast ChIP-seq |
Sample_Yeast_L005_R2.cat.fastq.gz | Paired-end Illumina, Second of pair, FASTQ | Yeast ChIP-seq |
human_rnaseq.fastq.gz | Paired-end Illumina, First of pair only, FASTQ | Human RNA-seq |
human_mirnaseq.fastq.gz | Single-end Illumina, FASTQ | Human microRNA-seq |
cholera_rnaseq.fastq.gz | Single-end Illumina, FASTQ | V. cholerae RNA-seq |
First copy the two human datasets to your $SCRATCH/core_ngs/fastq_prep directory.
Code Block | ||||
---|---|---|---|---|
| ||||
cd $SCRATCH/core_ngs/fastq_prep
cp $CLASSDIR/human_stuff/*rnaseq.fastq.gz . |
Create a $SCRATCH/core_ngs/align directory and make a link to the fastq_prep directory.
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir -p $SCRATCH/core_ngs/align
cd $SCRATCH/core_ngs/align
ln -s -f ../fastq_prep fq
ls -l
ls fq |
Reference Genomes
Before we get to alignment, we need a genome to align to. We will use three different references here:
- the human genome (hg19)
- the yeast genome (sacCer3)
- and mirbase (v20), human subset
...
Reference Genomes
Before we get to alignment, we need a genome to align to. We will use four different references here:
- the human genome (hg19)
- the yeast genome (sacCer3)
- the microRNA database mirbase (v20), human subset
- a Vibrio cholerae genome (0395; our name: vibCho)
NOTE: For the sake of simplicity, these are not necessarily the most recent versions of these references - for example, hg19 is the second most recent human genome, with the most recent called hg38. Similarly, the most recent mirbase annotation is v21.
...
Expand | ||
---|---|---|
| ||
|
These are the three four reference genomes we will be using today, with some information about them (and here is information about many more genomes):
Reference | Species | Base Length | Contig Number | Source | Download | |||||
---|---|---|---|---|---|---|---|---|---|---|
hg19 | Human | 3.1 Gbp | 25 (really 93) | UCSC | UCSC GoldenPath | |||||
sacCer3 | Yeast | 12.2 Mbp | 17 | UCSC | UCSC GoldenPath | |||||
mirbase V20 | Human | 160 Kbp | 1908 | Mirbase | MirbaseHuman | 160 Kbp | 1908 | Mirbase | Mirbase Downloads | |
vibCho (O395) | V. cholerae | ~4 Mbp | 2 | GenBank | GenBank Downloads |
Searching genomes is computationally hard work and takes a long time if done on un-indexed, linear genomic sequence. So aligners require that references first be indexed to accelerate later retrieval. The aligners we are using each require a different index, but use the same method (the Burrows-Wheeler Transform) to get the job done. This requires involves taking a FASTA file as input, with each chromosome (or contig) as a separate FASTA entry, and producing some an aligner-specific set of files as output. Those output index files are then used by the aligner when performing the sequence alignment, and subsequent alignments are reported using language coordinates referencing positions in the input original FASTA filereference files.
hg19 is way too big for us to index here, so we're not going to do it (especially not all at the same time!). Instead, all we will "point" to an existing set of hg19 index files, which are all located at:
Code Block | ||||
---|---|---|---|---|
| ||||
/scratch/01063/abattenh/ref_genome/bwa/bwtsw/hg19 |
However, we can index the references for the yeast genome, the human miRNAs, and the V. cholerae genome and human miRNAs, because they are much smaller. We will grab the FASTA files for the other two references and build each index right before we use. These two reference FASTA files are located at, because they are all tiny compared to the human genome. We will grab the FASTA files for yeast and human miRNAs two references and build each index right before we use them. We will also grab the special file that contains the V. cholerae genome sequence and annotations (a "gbk" file), and generate the reference FASTA and some other interesting information when we get to that exercise. These references are currently at the following locations:
Code Block | ||||
---|---|---|---|---|
| ||||
/corral-repl/utexas/BioITeam/core_ngs_tools/references/sacCer3.fa
/corral-repl/utexas/BioITeam/core_ngs_tools/references/hairpin_cDNA_hsa.fa
/corral-repl/utexas/BioITeam/core_ngs_tools/references/vibCho.O395.gbk |
First stage all the yeast and mirbase reference FASTA files in your work archive "core_ngs" area in a directory called references.
We will add further structure to this directory later on in specific exercises, but for now the following will suffice:
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir -p $WORK/archivecore_ngs/references/fasta cp $CLASSDIR/references/*.fa $WORK/archive/references/fasta/ |
...