Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Clearly, there are many file formats that we can use this script to convert.  In our case, we are moving from genbank to fasta, so the commands we would execute to produce and view the FASTA files would look like this:

Code Block
languagebash
bp_seqconvert.pl --from genbank --to fasta < vibCho.O395.gbk > vibCho.O395.fa
grep ">" vibCho.O395.fa
less vibCho.O395.fa

...

Recall from when we viewed the GenBank file that there are genome annotations available as well that we would like to extract into GFF format.  However, the bp_seqconvert.pl script is designed to be used to convert sequence formats, not annotation formats. Fortunately, there is another script called bp_genbank2gff3.pl that can take a GenBank file and produce a GFF3 (the most recent format convention for GFF files) file. To run it and see the output, run these commands:

Code Block
languagebash
bp_genbank2gff3.pl --format Genbank vibCho.O395.gbk
mv vibCho.O395.gbk.gff vibCho.O395.gff
less vibCho.O395.gff

After the header lines, each feature in the genome is represented by a line that gives chromosome, start, stop, strand, and other information.  Features are things like "mRNA," "CDS," and "EXON."  As you would expect in a prokaryotic genome it is frequently the case that the gene, mRNA, CDS, and exon annotations are identical, meaning they share coordinate information. You could parse these files further using commands like grep  and awk  to extract, say, all exons from the full file or to remove the header lines that begin with #.

 

Introducing bowtie2

 

Go ahead and load the bowtie2 module so we can examine some help pages and options. To do that, you must first load the perl module, and then the a specific version of bowtie2 

code
Code Block
language
bash
module load perl
module load bowtie/2.2.0

 Now that it's loaded, check out the options. There are a lot of them! In fact for the full range of options and their meaning, Google "Bowtie2 manual" and bring up that page. The Table of Contents is several pages long! Ouch!

 

This is the key to using bowtie2 - it allows you to control almost everything about its behavior, but that also makes it is much more challenging to use than bwa – and it's easier to screw things up too!

Building the bowtie2 vibCho index

Before the alignment, of course, we've got to build a mirbase index using bowtie2-build (go ahead and check out its options). Unlike for the aligner itself, we only need to worry about a few things here:

  • reference_in file is just the FASTA file containing mirbase v20 sequences
  • bt2_index_base is the prefix of where we want the files to go

To build the reference index for alignment, we actually only need the FASTA file, since annotations are often not necessary for alignment. (This is not always true - extensively spliced transcriptomes requires splice junction annotations to align RNA-seq data properly, but for now we will only use the FASTA file.)  We build the reference files exactly as we did with mirbase, only swapping out the FASTA files as follows:

Code Block
languagebash
mkdir -p $WORK/core_ngs/references/bt2/vibCho
mv $WORK/core_ngs/references/vibCho.O395.fa $WORK/core_ngs/references/fasta
cd $WORK/core_ngs/references/bt2/vibCho
ln -s -f ../../fasta/vibCho.O395.fa
ls -la

Now build the index using bowtie2-build like we did for the mirbase index:

Code Block
languagebash
titlePrepare Bowtie2 index files
bowtie2-build vibCho.O395.fa vibCho.O395

...

When the job is complete you should have a cholera_rnaseq.sam file that you can examine using whatever commands you like.  In the past two exercises, we have explored a few different sorts of commands that you might run on this file to get different sorts of information.  Those commands will provide similar information here, since this file is also in SAM/BAM format.  In the last exercise, we will come back to this SAM file (and the others) to explore ways to compress them effectively and to extract basic quality statistics.

Exercise

...

#3: Bowtie2

...

local alignment - Human microRNA-seq

Now we're going to switch over to a different aligner that was originally designed for very short reads and is frequently used for RNA-seq data. Accordingly, we have prepared another test microRNA-seq dataset for you to experiment with (not the same one you used cutadapt on). This data is derived from a human H1 embryonic stem cell (H1-hESC) small RNA dataset generated by the ENCODE Consortium – its about a half million reads.

...

Code Block
languagebash
titlePrepare Bowtie2 index directory for mirbase
mkdir -p $WORK/core_ngs/references/bt2/mirbase.v20
mv $WORK/core_ngs/references/hairpin_cDNA_hsa.fa $WORK/core_ngs/references/fasta
cd $WORK/core_ngs/references/bt2/mirbase.v20
ln -s -f ../../fasta/hairpin_cDNA_hsa.fa
ls -la 

Now build the mirbase index with bowtie2-build like we did for the V. cholerae index:

Code Block
languagebash
titlePrepare Bowtie2 index directory for mirbase
bowtie2-build hairpin_cDNA_hsa.fa hairpin_cDNA_hsa.fa

...

Code Block
-N 1 -L 16 --local

...


Expand
titleIf our reads were longer

Because these are short reads we do not have to adjust parameters like inter-seed distance (-i) or minimum alignment score (--min-score) that are a function of read length. If we were processing longer reads, we might need to use parameters like this, to force bowtie2 to "pretend" the read is a short, constant length:

-i C,1,0
--score-min=C,32,0

Yes, that looks complicated, and it kind of is. It's basically saying "slide the seed down the read one base at a time", and "report alignments as long as they have a minimum alignment score of 32 (16 matching bases x 2 points per match, minimum).

See the bowtie2 manual (after you have had a good stiff drink) for a full explanation.

...