...
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir -p $WORK/archive/references/bt2/mirbase.v20 cd $WORK/archive/references/bt2/mirbase.v20 ln -s -f ../../fasta/hairpin_cDNA_hsa.fa ls -la |
Now build the index with bowtie2-build:
...
To decide how we want to go about doing our alignment, check out the file we're aligning with less:
Code Block | ||||
---|---|---|---|---|
| ||||
| ||||
cd $SCRATCH/core_ngs/align Code Block |
less fastq_alignfq/human_mirnaseq.fastq.gz |
Lots of those reads have long strings of A's, which must be an adapter or protocol artifact. Even though we see how we might be able to fix it using some tools we've talked about, what if we had no idea what the adapter sequence was, or couldn't use cutadapt or other programs to prepare the reads?
In that case, we need a local alignment where the seed length is the small boundary of the acceptable internal alignmentssmaller than the expected insert size. Here, we are interested in finding any sections of any reads that align well to a microRNA. These sequences , which are between 16 and 22 24 bases long, so any good with most 20-22. So an acceptable alignment should have at least 16 matching bases, but could have more. Also, maybe we
If we're also interested in detecting miRNA SNPs, we might want to allow a mismatch or two in the seed, since we might be interested in miRNA SNPs. So. So, a good set of options might look something like this:
Code Block |
---|
-N 1 -L 16 --local |
...
Expand | ||
---|---|---|
| ||
Because these are short reads we do not have to adjust parameters like inter-seed distance (-i) or minimum alignment score (--min-score) that are a function of read length. If we were processing longer reads, we might need to use parameters like this, to force bowtie2 to "pretend" the read is a short, constant length: -i C,1,0 Yes, that looks complicated, and it kind of is. See the bowtie2 manual (after you have had a good stiff drink) for a full explanation. |
As you can tell from looking at the Bowtie2 bowtie2 help message, the general syntax looks like this:
Code Block |
---|
bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>] |
As such, our alignment command (now that we have the FASTQ file and the reference sequence ready) could be this (make sure you are located in your scratch directory!):
Putting this all together and linking to the index directory, we have this. Go ahead and execute this from the command line – it will be fast.
Code Block |
---|
ln -s -f $WORK/archive/reference/bt2/mirbase.v20 mb20
bowtie2 --local |
Code Block |
cds bowtie2 -N 1 -L 16 --local -x references/mirbasemb20/hairpin_cDNA_hsa.fa -U fastq_alignfq/human_mirnaseq.fastq.gz -S alignments/human_mirnaseq.samhuman_mirnaseq.sam |
Expand | ||
---|---|---|
| ||
|
Now you should have a humanNow, you should have a human_mirnaseq.sam file in your alignments directory, that you can check out examine using whatever commands you like. An An example alignment looks like this:this (note this is one alignment record, although it has been broken up below for readability).
Code Block |
---|
TUPAC_0037_FC62EE7AAXX:2:1:2607:1430#0/1 0 hsa-mir-302b 50 22 3S20M13S * 0 0 TACGTGCTTCCATGTTTTANTAGAAAAAAAAAAAAG ZZFQV]Z[\IacaWc]RZIBVGSHL_b[XQQcXQcc AS:i:37 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:16G3 YT:Z:UU |
Notes:Note how
- Notice the CIGAR string is 3S20M13S, meaning that
...
- 3 bases were soft clipped from one end (3S), and
...
- 13 from the other (13S).
...
- If we did the same alignment using either bowtie2 --end-to-end mode, or using
...
- bwa aln as in Exercise #1, very little of this file would have aligned
...
- .
- The 20M part of the CIGAR string says there was a block of 20 read bases that mapped to the reference.
- If we had not lowered the seed parameter of Bowtie2 from its default of 22, we would not have found many of the alignments like
...
- this one that only matched for 20 bases.
Such - a matching 22 base seed does not exist. Such is the nature of Bowtie2 - bowtie2 – it it can be a powerful tool to sift out the alignments you want from a messy dataset with limited information, but doing so requires careful tuning of the parameters, which can in itself take quite a lot of time to perfectfew trials to figure out.
Exercise #3: BWA-MEM (and Tophat2) - Human mRNA-seq
...