Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Converting GenBank files into usable reference FASTA files and genome annotations (GFF files)

BioPerl is available through the module system, so load it up like we have done before, with the caveat that we load regular Perl before we load BioPerl.

Code Block
module load perl
module load bioperl

This makes several scripts directly available to you.  The one we will use is called "bp_seqconvert.pl," and it is used to interconvert file formats between FASTA, GBK, and others.  While we will use it to obtain a FASTA sequence, this same program can also be used to get gene annotations, among other genomic information, from the same GenBank file.

...

As noted earlier, many microbial genomes are available through repositories like GenBank that use specific file format conventions for storage and distribution of genome sequence and annotations.  The "Genbank" file format is a text file that can be parsed to yield other files that are compatible with the pipelines we have been implementing.  Go ahead and look at some of the contents of a Genbank file with the following commands:

Code Block
cd $SCRATCH/core_ngs/references
less vibCho.O395.gbk
grep -A 5 ORIGIN vibCho.O395.gbk

As the 'less' command shows, the file begins with a description of the organism and some source information, and the contains annotations for each bacterial gene.  The 'grep' command shows that, indeed, there is sequence information here (flagged by the word ORIGIN) that could be exported into a FASTA file.  There are a couple ways of extracting the information we want, namely the reference genome and the gene annotation information, but a convenient one (that is available through the module system at TACC) is BioPerl.

We load BioPerl like we have loaded other modules, with the caveat that we must load regular Perl before loading BioPerl:

Code Block
module load perl
module load bioperl

These commands make several scripts directly available to you.  The one we will use is called bp_seqconvert.pl, and it is a BioPerl script used to interconvert file formats like FASTA, GBK, and others.  We will use this script to get both a FASTA format for indexing and alignment, and also a GFF file (standing for General Feature Format) that will contain information about all genes (or, more generally, features) in the genome.  To see how to use the script, just execute:

Code Block
bp_seqconvert.pl

Clearly, there are many file formats that we can use this script to convert.  In our case, we are moving "from" genbank "to" fasta, so the commands we would execute to produce and view the FASTA files would look like this:

Code Block
bp_seqconvert.pl --from genbank --to fasta < vibCho.O395.gbk > vibCho.O395.fa
grep ">" vibCho.O395.fa
less vibCho.O395.fa

Now we have a reference sequence file that we can use with the bowtie2 reference builder, and ultimately align sequence data against.  However, recall from when we viewed the Genbank file that there are genome annotations available as well that we would like to extract into GFF format.  However, the bp_seqconvert.pl script is designed to be used to convert sequence formats, not annotation formats.  Fortunately, there is another script called bp_genbank2gff3.pl that can take a Genbank file and produce a GFF3 (the most recent format convention for GFF files) file.  To run it and see the output, run these commands:

Code Block
bp_genbank2gff3.pl --format Genbank vibCho.O395.gbk
mv vibCho.O395.gbk.gff vibCho.O395.gff
less vibCho.O395.gff

After the header lines, each feature in the genome is represented by a line that gives chromosome, start, stop, strand, and other information.  Features are things like "mRNA," "CDS," and "EXON."  As you would expect in a prokaryotic genome it is frequently the case that the gene, mRNA, CDS, and exon annotations are identical, meaning they share coordinate information.  You could parse these files further using commands like grep  and awk  to extract, say, all exons from the full file or to remove the header lines that begin with "#".

Building the bowtie2 vibCho index

To build the reference index for alignment, we actually only need the FASTA file, since annotations are frequently not necessary for alignment purposes. (This is not always true, as extensively spliced transcriptomes requires splice junction annotations to align RNA-seq data properly, but for now we will only use the FASTA file.)  We build the reference files exactly as we did with mirbase, only swapping out the FASTA files as follows:

Code Block
bp_genbank2gff3.pl --format Genbank vibCho.O395.gbk
mv vibCho.O395.gbk.gff vibCho.O395.gff
less vibCho.O395.gff

 

 

Performing the bowtie2 alignment

...