...
While the alignment procedure for prokaryotes is broadly analogous, the reference preparation process is somewhat different, and will involve use of a biologically-oriented scripting library called BioPerl. In this exercise, we will use some RNA-seq data from Vibrio cholerae, published last year on GEO here, and align it to a reference genome.
Overview of Vibrio cholerae alignment workflow with Bowtie2
Alignment of this prokaryotic data follows the workflow below. Here we will concentrate on steps 1 and 2.
- Prepare the vibCho reference index for bowtie2 from a GenBank files record using BioPerl
- Align reads using bowtie2, producing a SAM file
- Convert the SAM file to a BAM file (samtools view)
- Sort the BAM file by genomic location (samtools sort)
- Index the BAM file (samtools index)
- Gather simple alignment statistics (samtools flagstat and samtools idxstat)
Obtaining the GenBank record(s)
V. cholerae has two chromosomes. We download each separately.
- Navigate to http://www.ncbi.nlm.nih.gov/nuccore/NC_012582
- click on the Send down arrow (top right of page)
- select Complete Record
- select Clipboard as Destination
- click Add to Clipboard
- Perform these steps in your Terminal window
- mkdir $WORK/tmp; cd $WORK/tmp
- type wget then paste the URL from the clipboard
- Repeat steps 1 and 2 fot the 2nd chromosome
- NCBI URL is http://www.ncbi.nlm.nih.gov/nuccore/NC_012583
- you should now have 2 files, NC_012582 and NC_012583
- Combine the 2 files into one using cat
- cat NC_012582 NC_012583 > vibCho.gbk
Converting GenBank records into sequence (FASTA) and annotation (GFF) files
As noted earlier, many microbial genomes are available through repositories like GenBank that use specific file format conventions for storage and distribution of genome sequence and annotations. The The GenBank file format is a text file that can be parsed to yield other files that are compatible with the pipelines we have been implementing. Go
Go ahead and look at some of the contents of a GenBank file with the following commands (execute these one at a time):
Code Block | ||
---|---|---|
| ||
cd $WORK/core_ngs/references less vibCho.O395.gbk grep -A 5 ORIGIN vibCho.O395.gbk |
...