Page History

...

While the alignment procedure for prokaryotes is broadly analogous, the reference preparation process is somewhat different, and will involve use of a biologically-oriented scripting library called BioPerl. In this exercise, we will use some RNA-seq data from Vibrio cholerae, published last year on GEO here, and align it to a reference genome.

Overview of Vibrio cholerae alignment workflow with Bowtie2

Alignment of this prokaryotic data follows the workflow below. Here we will concentrate on steps 1 and 2.

Prepare the vibCho reference index for bowtie2 from a GenBank files record using BioPerl
Align reads using bowtie2, producing a SAM file
Convert the SAM file to a BAM file (samtools view)
Sort the BAM file by genomic location (samtools sort)
Index the BAM file (samtools index)
Gather simple alignment statistics (samtools flagstat and samtools idxstat)

Obtaining the GenBank record(s)

V. cholerae has two chromosomes. We download each separately.

Navigate to http://www.ncbi.nlm.nih.gov/nuccore/NC_012582
- click on the Send down arrow (top right of page)
- select Complete Record
- select Clipboard as Destination
- click Add to Clipboard
Perform these steps in your Terminal window
- mkdir $WORK/tmp; cd $WORK/tmp
- type wget then paste the URL from the clipboard
  - should be http://www.ncbi.nlm.nih.gov/nuccore/NC_012582
Repeat steps 1 and 2 fot the 2nd chromosome
- NCBI URL is http://www.ncbi.nlm.nih.gov/nuccore/NC_012583
- you should now have 2 files, NC_012582 and NC_012583
Combine the 2 files into one using cat
- cat NC_012582 NC_012583 > vibCho.gbk

Converting GenBank records into sequence (FASTA) and annotation (GFF) files

As noted earlier, many microbial genomes are available through repositories like GenBank that use specific file format conventions for storage and distribution of genome sequence and annotations. The The GenBank file format is a text file that can be parsed to yield other files that are compatible with the pipelines we have been implementing. Go

Go ahead and look at some of the contents of a GenBank file with the following commands (execute these one at a time):

Code Block

language	bash

cd $WORK/core_ngs/references
less vibCho.O395.gbk
grep -A 5 ORIGIN vibCho.O395.gbk

...

Page tree

Versions Compared

Old Version 105

New Version 106

Key

Overview of Vibrio cholerae alignment workflow with Bowtie2

Obtaining the GenBank record(s)

Converting GenBank records into sequence (FASTA) and annotation (GFF) files