Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The test file we will be working with is JUST just the R1 file from a paired-end total RNA-seq experiment, meaning it is (for our purposes) single-end.  Go Go ahead and take a look at it, and find out how many reads are in the file.

code
Expand
titleHint:
cds
less fastq_align/human_rnaaseq.fastq.gz
gunzip -c fastq_align/human_rnaseq.fastq.gz | echo $((`wc -l`/4))

RNA-seq alignment with bwa aln

Now, try aligning it with BWA bwa aln like we did in example 1.  This won't take very long, but you'll need to use our pre-indexed hg19 reference.  It is located at:Example #1, but first link to the hg19 bwa index directory.

Code Block
languagebash
titleLink to BWA hg19 index directory
cd $SCRATCH/core_ngs_align
ln -s -f 
Code Block
/scratch/01063/abattenh/ref_genome/bwa/bwtsw/hg19
ls hg19.fa

You should see a set of files if you look at the contents of the 'bwtsw' directory in the above path, you'll see a set of files that are analogous to the yeast files we created earlier, except that their universal prefix is 'hg19.fa'.  Now, using that index, go .  

Go ahead and try to do a single-end alignment of the file to the human genome using bwa aln like we did in Exercise #1, and save the contents to saving intermediate files with the prefix 'human_rnaseq_bwa'.

expandcds
bwa aln 
/scratch/01063/abattenh/ref_genome/bwa/bwtsw/
hg19/hg19.fa 
fastq_align
fq/human_rnaseq.fastq.gz > 
alignments/
human_rnaseq_bwa.sai
bwa samse 
/scratch/01063/abattenh/ref_genome/bwa/bwtsw/hg19/hg19.fa alignments/human
hg19/hg19.fa human_rnaseq_bwa.sai 
fastq_align
fq/human_rnaseq.fastq.gz > 
alignments/
human_rnaseq_bwa.sam
Code Block
language
bash
titleHint:Commands to bwa aln RNA-seq data
Code Block

Now, Once this is complete use less to take a look at the contents of the sam SAM file, using the space bar to leaf through them.  YouYou'll notice a lot of alignments look basically like this:

Code Block
HWI-ST1097:228:C21WMACXX:8:1316:10989:88190     4       *       0       0       *       *       0       0
       AAATTGCTTCCTGTCCTCATCCTTCCTGTCAGCCATCTTCCTTCGTTTGATCTCAGGGAAGTTCAGGTCTTCCAGCCGCTCTTTGCCACTGATCTCCAGCT AAATTGCTTCCTGTCCTCATCCTTCCTGTCAGCCATCTTCCTTCGTTTGATCTCAGGGAAGTTCAGGTCTTCCAGCCGCTCTTTGCCACTGATCTCCAGCT
   CCCFFFFFHHHHHIJJJJIJJJJIJJJJHJJJJJJJJJJJJJJIIIJJJIGHHIJIJIJIJHBHIJJIIHIEGHIIHGFFDDEEEDDCDDD@CDEDDDCDD

...

 CCCFFFFFHHHHHIJJJJIJJJJIJJJJHJJJJJJJJJJJJJJIIIJJJIGHHIJIJIJIJHBHIJJIIHIEGHIIHGFFDDEEEDDCDDD@CDEDDDCDD

Notice that the contig name (field 3) is just an asterisk ( * ) and the alignment flags value is a 4 (field 2), meaning the read did not align (decimal 4 = hex 0x4 = read did not map).

Essentially, nothing (with a few exceptions) aligned. Why?

Expand
titleAnswer

Because this file was generated exclusively from reads in a larger dataset that cross at least one splice junction

...

. The sequences as they exists in most of the reads

...

do not correspond to a single location in the genome

...

. However subsections of each read do exist somewhere in the genome.

...

So, we need an aligner that is capable aligning different parts of the read to different genomic loci.

Exercise: use bwa mem to align the same data

Based of finding regions of each read (above some length cutoff) that align to the genome.  Based on the following syntax , and using the above reference path, we noted earlier, try to use BWA MEM use bwa mem to align the same file, single-end, saving all output files with the prefix 'human_rnaseq_mem':. Go ahead and just execute on the command line.

Code Block
bwa mem ref<ref.fafa> reads<reads.fqfq> > aln-seoutfile.sam
code
Expand
titleHint:
bwa mem 
/scratch/01063/abattenh/ref_genome/bwa/bwtsw/
hg19/hg19.fa 
fastq_align
fq/human_rnaseq.fastq.gz > 
alignments/
human_rnaseq_mem.sam

Now, check the length of the SAM file you generated with 'wc -l'.  Since there is one alignment per line, there must be 586266 alignments, which is more than the number of sequences in the FASTQ file.  This is because many reads will align twice or more, hopefully on either side of a splice junction.  These alignments can still be associated because they will have the same read ID, but are reflected in more than one line.  To get an idea of how often each read aligned, and what the 'real' alignment rate is, use the following commands:

...