Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleHint:
Code Block
mkdir $SCRATCH/references
cp -r /corral-repl/utexas/BioITeam/core_ngs_tools/references/*.fa $SCRATCH/references

With that, we're ready to get started on the first exercise.

...

Expand
titleHint:
Code Block
module load perl
module load bowtie/2.2.0
bowtie2

Now that it's loaded, check out the options.  There are a LOT of them!  This is the key to using Bowtie2 - it allows you to control almost everything about its behavior, but that also makes it easier to screw things up.  Before getting to using the tool, though, we've got to build a mirbase index.  This requires using the command "bowtie2-build" - go ahead and check out its options.  Unlike for the aligner itself, we only need to worry about a few things here:

Code Block
bowtie2-build <reference_in> <bt2_index_base>

Here, the reference_in file is just our FASTA file containing mirbase sequences, and the bt2_index_base is the prefix of where we want the files to go.  Following what we did earlier for BWA indexing:

Code Block
cd $SCRATCH/references/
mkdir mirbase
mv hairpin_cDNA_hsa.fa mirbase
cd mirbase
bowtie2-build hairpin_cDNA_hsa.fa hairpin_cDNA_hsa.fa

That was very fast!  It's because the mirbase reference genome is so small compared to what programs like this are used to dealing with, which is the human genome (or bigger).  Now, your $SCRATCH/references/mirbase directory should be filled with the following files:

Code Block
hairpin_cDNA_hsa.fa
hairpin_cDNA_hsa.fa.1.bt2
hairpin_cDNA_hsa.fa.2.bt2
hairpin_cDNA_hsa.fa.3.bt2
hairpin_cDNA_hsa.fa.4.bt2
hairpin_cDNA_hsa.fa.rev.1.bt2
hairpin_cDNA_hsa.fa.rev.2.bt2

Now, we're ready to actually try to do the alignment.  Remember, unlike BWA, we actually need to set some options depending on what we're after.  These are the most important options when using Bowtie2:

OptionEffect
-NControls the number of mismatches allowable in the seed of each alignment (default = 0)
-LControls the length of seed substrings generated from each read (default = 22)
--end-to-end or --localControls whether the entire read must align to the reference, or whether soft-clipping the ends is allowed to find internal alignments
-maControls the alignment score contribution of a matching base (0 for --end-to-end, 2 for --local

To decide how we want to go about doing our alignment, check out the file we're aligning with 'less'.

Expand
titleHint:
Code Block
cds
less fastq_align/human_mirnaseq.fastq.gz

Lots of those reads have long strings of A's, which must be an adapter or protocol artifact.  Even though we see how we might be able to fix it using some tools we've talked about, what if we had no idea what the adapter sequence was, or couldn't use cutadapt or other programs to prepare the reads?  In that case, we need a local alignment where the seed length is the small boundary of the acceptable internal alignments. Here, we are interested in finding any sections of any reads that align well to a microRNA.  These sequences are between 16 and 22 bases long, so any good alignment should have at least 16 matching bases, but could have more.  Also, maybe we want to allow a mismatch or two in the seed, since we might be interested in miRNA SNPs.  So, a good set of options might look something like this:

Code Block
-N 1 -L 16 --local

This leaves the default scoring method as "-ma 2", meaning that a 16 base pair alignment will have a score of 32, and so on.  This is VERY different from the alignment scores assigned by other aligners, so it's worth remembering.


As you can tell from looking at the Bowtie2 help message, the syntax looks like this:

 

Code Block
bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]

As such, our alignment command (now that we have the FASTQ file and the reference sequence ready) could be this (make sure you are located in your scratch directory!):

Code Block
cds
bowtie2 -N 1 -L 16 --local -x references/mirbase/hairpin_cDNA_hsa.fa -U fastq_align/human_mirnaseq.fastq.gz -S alignments/human_mirnaseq.sam

Now, you should have a human_mirnaseq.sam file in your alignments directory, that you can check out using whatever commands you like.  An example alignment looks like this:

Code Block
TUPAC_0037_FC62EE7AAXX:2:1:2607:1430#0/1        0       hsa-mir-302b    50      22      3S20M13S        *       0       0       TACGTGCTTCCATGTTTTANTAGAAAAAAAAAAAAG    ZZFQV]Z[\IacaWc]RZIBVGSHL_b[XQQcXQcc    AS:i:37 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:16G3       YT:Z:UU

Note how the CIGAR string is 3S20M13S, meaning that 13 bases were soft clipped from one end, and 3 from the other.  If we did the same alignment using either --end-to-end mode, or using BWA in the same way as we did in Exercise #1, very little of this file would have aligned.  However, if we had not lowered the seed parameter of Bowtie2 from its default of 22, we would not have found many of the alignments like the one shown above, because the read only matched for 20 bases - a matching 22 base seed does not exist.  Such is the nature of Bowtie2 - it can be a powerful tool to sift out the alignments you want from a messy dataset with limited information, but doing so requires careful tuning of the parameters, which can in itself take a lot of time to perfect.

Exercise #3: BWA-MEM (and Tophat2) - Human mRNA-seq

...