Page History

...

The test file we will be working with is JUST just the R1 file from a paired-end total RNA-seq experiment, meaning it is (for our purposes) single-end. Go Go ahead and take a look at it, and find out how many reads are in the file.

code

Expand

title	Hint:

cds
less fastq_align/human_rnaaseq.fastq.gz
gunzip -c fastq_align/human_rnaseq.fastq.gz | echo $((`wc -l`/4))

RNA-seq alignment with bwa aln

Now, try aligning it with BWA bwa aln like we did in example 1. This won't take very long, but you'll need to use our pre-indexed hg19 reference. It is located at:Example #1, but first link to the hg19 bwa index directory.

Code Block

language	bash
title	Link to BWA hg19 index directory

cd $SCRATCH/core_ngs_align
ln -s -f

Code Block

/scratch/01063/abattenh/ref_genome/bwa/bwtsw/hg19
ls hg19.fa

You should see a set of files if you look at the contents of the 'bwtsw' directory in the above path, you'll see a set of files that are analogous to the yeast files we created earlier, except that their universal prefix is 'hg19.fa'. Now, using that index, go .

Go ahead and try to do a single-end alignment of the file to the human genome using bwa aln like we did in Exercise #1, and save the contents to saving intermediate files with the prefix 'human_rnaseq_bwa'.

expandcds

bwa aln

/scratch/01063/abattenh/ref_genome/bwa/bwtsw/

hg19/hg19.fa

fastq_align

fq/human_rnaseq.fastq.gz >

alignments/

human_rnaseq_bwa.sai
bwa samse

/scratch/01063/abattenh/ref_genome/bwa/bwtsw/hg19/hg19.fa alignments/human

hg19/hg19.fa human_rnaseq_bwa.sai

fastq_align

fq/human_rnaseq.fastq.gz >

alignments/

human_rnaseq_bwa.sam

Code Block

language

bash
title	Hint:	Commands to bwa aln RNA-seq data

Code Block

Now, Once this is complete use less to take a look at the contents of the sam SAM file, using the space bar to leaf through them. YouYou'll notice a lot of alignments look basically like this:

Code Block

HWI-ST1097:228:C21WMACXX:8:1316:10989:88190     4       *       0       0       *       *       0       0
       AAATTGCTTCCTGTCCTCATCCTTCCTGTCAGCCATCTTCCTTCGTTTGATCTCAGGGAAGTTCAGGTCTTCCAGCCGCTCTTTGCCACTGATCTCCAGCT AAATTGCTTCCTGTCCTCATCCTTCCTGTCAGCCATCTTCCTTCGTTTGATCTCAGGGAAGTTCAGGTCTTCCAGCCGCTCTTTGCCACTGATCTCCAGCT
   CCCFFFFFHHHHHIJJJJIJJJJIJJJJHJJJJJJJJJJJJJJIIIJJJIGHHIJIJIJIJHBHIJJIIHIEGHIIHGFFDDEEEDDCDDD@CDEDDDCDD

...

 CCCFFFFFHHHHHIJJJJIJJJJIJJJJHJJJJJJJJJJJJJJIIIJJJIGHHIJIJIJIJHBHIJJIIHIEGHIIHGFFDDEEEDDCDDD@CDEDDDCDD

Notice that the contig name (field 3) is just an asterisk ( * ) and the alignment flags value is a 4 (field 2), meaning the read did not align (decimal 4 = hex 0x4 = read did not map).

Essentially, nothing (with a few exceptions) aligned. Why?

Expand

title	Answer

Because this file was generated exclusively from reads in a larger dataset that cross at least one splice junction

...

. The sequences as they exists in most of the reads

...

do not correspond to a single location in the genome

...

. However subsections of each read do exist somewhere in the genome.

...

So, we need an aligner that is capable aligning different parts of the read to different genomic loci.

Exercise: use bwa mem to align the same data

Based of finding regions of each read (above some length cutoff) that align to the genome. Based on the following syntax , and using the above reference path, we noted earlier, try to use BWA MEM use bwa mem to align the same file, single-end, saving all output files with the prefix 'human_rnaseq_mem':. Go ahead and just execute on the command line.

Code Block
bwa mem ref<ref.fafa> reads<reads.fqfq> > aln-seoutfile.sam

code

Expand

title	Hint:

bwa mem

/scratch/01063/abattenh/ref_genome/bwa/bwtsw/

hg19/hg19.fa

fastq_align

fq/human_rnaseq.fastq.gz >

alignments/

human_rnaseq_mem.sam

Now, check the length of the SAM file you generated with 'wc -l'. Since there is one alignment per line, there must be 586266 alignments, which is more than the number of sequences in the FASTQ file. This is because many reads will align twice or more, hopefully on either side of a splice junction. These alignments can still be associated because they will have the same read ID, but are reflected in more than one line. To get an idea of how often each read aligned, and what the 'real' alignment rate is, use the following commands:

...

Page tree

Versions Compared

Old Version 49

New Version 50

Key

RNA-seq alignment with bwa aln