Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titlePrepare Bowtie2 index directory for mirbase
mkdir -p $WORK/archive/references/bt2/mirbase.v20
cd $WORK/archive/references/bt2/mirbase.v20
ln -s -f ../../fasta/hairpin_cDNA_hsa.fa
ls -la
 

Now build the index with bowtie2-build:

...

To decide how we want to go about doing our alignment, check out the file we're aligning with less:

expandcds

less 
fastq_align
fq/human_mirnaseq.fastq.gz
Code Block
language
bash
titleHint:Examine miRNA sequences to be aligned
cd $SCRATCH/core_ngs/align
Code Block

Lots of those reads have long strings of A's, which must be an adapter or protocol artifact.  Even though we see how we might be able to fix it using some tools we've talked about, what if we had no idea what the adapter sequence was, or couldn't use cutadapt or other programs to prepare the reads?

 In that case, we need a local alignment where the seed length is the small boundary of the acceptable internal alignmentssmaller than the expected insert size. Here, we are interested in finding any sections of any reads that align well to a microRNA.  These sequences , which are between 16 and 22 24 bases long, so any good with most 20-22. So an acceptable alignment should have at least 16 matching bases, but could have more.  Also, maybe we

If we're also interested in detecting miRNA SNPs, we might want to allow a mismatch or two in the seed, since we might be interested in miRNA SNPs.  So. So, a good set of options might look something like this:

Code Block
-N 1 -L 16 --local

...

 

Expand
titleIf our reads were longer

Because these are short reads we do not have to adjust parameters like inter-seed distance (-i) or minimum alignment score (--min-score) that are a function of read length. If we were processing longer reads, we might need to use parameters like this, to force bowtie2 to "pretend" the read is a short, constant length:

-i C,1,0
--score-min=C,32,0

Yes, that looks complicated, and it kind of is. See the bowtie2 manual (after you have had a good stiff drink) for a full explanation.

As you can tell from looking at the Bowtie2 bowtie2 help message, the general syntax looks like this:

Code Block
bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]

As such, our alignment command (now that we have the FASTQ file and the reference sequence ready) could be this (make sure you are located in your scratch directory!):

Putting this all together and linking to the index directory, we have this. Go ahead and execute this from the command line – it will be fast.

Code Block
ln -s -f $WORK/archive/reference/bt2/mirbase.v20 mb20
bowtie2 --local
Code Block
cds
bowtie2 -N 1 -L 16 --local -x references/mirbasemb20/hairpin_cDNA_hsa.fa -U fastq_alignfq/human_mirnaseq.fastq.gz -S alignments/human_mirnaseq.samhuman_mirnaseq.sam
Expand
titleWhat's going on?
  • --local – local alignment mode
  • -L 16 – seed length 16
  • -N 1 – allow 1 mismatch in the seed
  • -x <pfx> – prefix patch of index files
  • -U <fq> – FASTQ file for single-end (Unpaired) alignment
  • -S <out.sam> – tells bowtie2 to report alignments in SAM format to the specified file

Now you should have a humanNow, you should have a human_mirnaseq.sam file in your alignments directory, that you can check out examine using whatever commands you like.  An An example alignment looks like this:this (note this is one alignment record, although it has been broken up below for readability).

Code Block
TUPAC_0037_FC62EE7AAXX:2:1:2607:1430#0/1        0       hsa-mir-302b    50      22      3S20M13S        *       0       0
       TACGTGCTTCCATGTTTTANTAGAAAAAAAAAAAAG    ZZFQV]Z[\IacaWc]RZIBVGSHL_b[XQQcXQcc
    AS:i:37 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:16G3       YT:Z:UU

Notes:Note how

  • Notice the CIGAR string is 3S20M13S, meaning that

...

  • 3 bases were soft clipped from one end (3S), and

...

  • 13 from the other (13S).

...


    • If we did the same alignment using either bowtie2 --end-to-end mode, or using

...

    • bwa aln as in Exercise #1, very little of this file would have aligned

...

    • .
  • The 20M part of the CIGAR string says there was a block of 20 read bases that mapped to the reference.
    • If we had not lowered the seed parameter of Bowtie2 from its default of 22, we would not have found many of the alignments like

...

    • this one that only matched for 20 bases.

Such - a matching 22 base seed does not exist.  Such is the nature of Bowtie2 - bowtie2 – it it can be a powerful tool to sift out the alignments you want from a messy dataset with limited information, but doing so requires careful tuning of the parameters, which can in itself take quite a lot of time to perfectfew trials to figure out.

Exercise #3: BWA-MEM (and Tophat2) - Human mRNA-seq

...