Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

As you can tell from looking at the Bowtie2 help message, the syntax looks like this:

 

Code Block
bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]

...

The test file we will be working with is JUST the R1 file from a paired-end total RNA-seq experiment, meaning it is (for our purposes) single-end.

Future Directions

 Go ahead and take a look at it, and find out how many reads are in the file.

Expand
titleHint:
Code Block
cds
less fastq_align/human_rnaaseq.fastq.gz
gunzip -c fastq_align/human_rnaseq.fastq.gz | echo $((`wc -l`/4))

Now, try aligning it with BWA like we did in example 1.  This won't take very long, but you'll need to use our pre-indexed hg19 reference.  It is located at:

Code Block
/scratch/01063/abattenh/ref_genome/bwa/bwtsw/hg19.fa

if you look at the contents of the 'bwtsw' directory in the above path, you'll see a set of files that are analogous to the yeast files we created earlier, except that their universal prefix is 'hg19.fa'.  Now, using that index, go ahead and try to do a single-end alignment of the file to the human genome like we did in Exercise #1, and save the contents to intermediate files with the prefix 'human_rnaseq_bwa'.

Expand
titleHint:
Code Block
cds
bwa aln /scratch/01063/abattenh/ref_genome/bwa/bwtsw/hg19/hg19.fa fastq_align/human_rnaseq.fastq.gz > alignments/human_rnaseq_bwa.sai
bwa samse /scratch/01063/abattenh/ref_genome/bwa/bwtsw/hg19/hg19.fa alignments/human_rnaseq_bwa.sai fastq_align/human_rnaseq.fastq.gz > alignments/human_rnaseq_bwa.sam

Now, use less to take a look at the contents of the sam file, using the space bar to leaf through them.  You'll notice a lot of alignments look basically like this:

Code Block
HWI-ST1097:228:C21WMACXX:8:1316:10989:88190     4       *       0       0       *       *       0       0
       AAATTGCTTCCTGTCCTCATCCTTCCTGTCAGCCATCTTCCTTCGTTTGATCTCAGGGAAGTTCAGGTCTTCCAGCCGCTCTTTGCCACTGATCTCCAGCT
   CCCFFFFFHHHHHIJJJJIJJJJIJJJJHJJJJJJJJJJJJJJIIIJJJIGHHIJIJIJIJHBHIJJIIHIEGHIIHGFFDDEEEDDCDDD@CDEDDDCDD

meaning they are not alignments at all.  Essentially, nothing (with a few exceptions) aligned.  That's because this file was generated exclusively from reads in a larger dataset that cross at least one splice junction (surprise!), meaning that they sequence as it exists in most of the reads does not exist anywhere in the genome, but some subsections of each read do exist somewhere in the genome.  So, we need an aligner that is capable of finding regions of each read (above some length cutoff) that align to the genome.  Based on the following syntax, and using the reference path, we noted earlier, try to use BWA MEM to align the same file, single-end, saving all output files with the prefix 'human_rnaseq_mem':

Code Block
bwa mem ref.fa reads.fq > aln-se.sam
Expand
titleHint:
Code Block
bwa mem /scratch/01063/abattenh/ref_genome/bwa/bwtsw/hg19/hg19.fa fastq_align/human_rnaseq.fastq.gz > alignments/human_rnaseq_mem.sam

Now, check the length of the SAM file you generated with 'wc -l'.  Since there is one alignment per line, there must be 586266 alignments, which is more than the number of sequences in the FASTQ file.  This is because many reads will align twice or more, hopefully on either side of a splice junction.  These alignments can still be associated because they will have the same read ID, but are reflected in more than one line.  To get an idea of how often each read aligned, and what the 'real' alignment rate is, use the following commands:

Code Block
cut -f 1 alignments/human_rnaseq_mem.sam | sort | uniq -c | less	#This gives you a view where each read is listed next to the number of entries it has in the SAM file
cut -f 1 human_rnaseq.sam | sort | uniq -c | awk '{print $1}' | sort | uniq -c | less	#This gives essentially a histogram of the number of times each read aligned - a plurality of reads aligned twice, which seems reasonable since these are all reads crossing a junction, but plenty aligned more or less
cut -f 1 human_rnaseq.sam | sort | uniq | wc -l	#This gives a better idea of the alignment rate, which is how many reads aligned at least once to the genome.  Divided by the number of reads in the original file, the real alignment rate is around 64.19 %.
 

This alignment rate is pretty good, but it could get better by playing around with the finer details of BWA MEM.  It is also a bit higher if you use an aligner that is more explicitly concerned with sensitivity to splice sites, namely a program like Tophat2.  All Tophat programs use Bowtie or Bowtie2 as the actual aligner, but split up each read into smaller pieces to align individually, then tries to reconstruct reasonable overall alignments from the pieces.  In fact, these reads can all be aligned by using Tophat2 with the right parameters.