The University Wiki Service will undergo an upgrade on May 26th, 2022 from 8:00 PM to 10:00 PM.
During this time the service will be unavailable, please save all changes before maintenance begins.

Please refer to the Confluence upgrade release notes for a list of changes.
If you have any questions, please email help@wikis.utexas.edu or call the UT Service Desk at 512-475-9400, thank you!
Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

While the alignment procedure for prokaryotes is broadly analogous, the reference preparation process is somewhat different, and will involve use of a biologically-oriented scripting library called BioPerl.  In this exercise, we will use some RNA-seq data from Vibrio cholerae, published last year on GEO here, and align it to a reference genome.

Overview of Vibrio cholerae alignment workflow with Bowtie2

Alignment of this prokaryotic data follows the workflow below. Here we will concentrate on steps 1 and 2.

  1. Prepare the vibCho reference index for bowtie2 from a GenBank files  record using BioPerl
  2. Align reads using bowtie2, producing a SAM file
  3. Convert the SAM file to a BAM file (samtools view) 
  4. Sort the BAM file by genomic location (samtools sort)
  5. Index the BAM file (samtools index)
  6. Gather simple alignment statistics (samtools flagstat and samtools idxstat)

Obtaining the GenBank record(s)

V. cholerae has two chromosomes. We download each separately.

  1. Navigate to http://www.ncbi.nlm.nih.gov/nuccore/NC_012582
    • click on the Send down arrow (top right of page)
    • select Complete Record
    • select Clipboard as Destination
    • click Add to Clipboard
  2. Perform these steps in your Terminal window
  3. Repeat steps 1 and 2 fot the 2nd chromosome
  4. Combine the 2 files into one using cat
    • cat NC_012582  NC_012583 > vibCho.gbk

Converting GenBank records into sequence (FASTA) and annotation (GFF) files

As noted earlier, many microbial genomes are available through repositories like GenBank that use specific file format conventions for storage and distribution of genome sequence and annotations.  The The GenBank file format is a text file that can be parsed to yield other files that are compatible with the pipelines we have been implementing.  Go

Go ahead and look at some of the contents of a GenBank file with the following commands (execute these one at a time):

Code Block
languagebash
cd $WORK/core_ngs/references
less vibCho.O395.gbk
grep -A 5 ORIGIN vibCho.O395.gbk

...