Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We will also use two additional RNA-seq datasets, which are located at:

Code Block
$CLASSDIR/corral-repl/utexas/BioITeam/core_ngs_tools/human_stuff
File NameDescriptionSample
Sample_Yeast_L005_R1.cat.fastq.gzPaired-end Illumina, First of pair, FASTQYeast ChIP-seq
Sample_Yeast_L005_R2.cat.fastq.gzPaired-end Illumina, Second of pair, FASTQYeast ChIP-seq
human_rnaseq.fastq.gzPaired-end Illumina, First of pair only, FASTQHuman RNA-seq
human_mirnaseq.fastq.gzSingle-end Illumina, FASTQHuman microRNA-seq

...

Exercise: How many contigs are there in the sacCer3 reference?

Expand
titleHint
Code Block
languagebash
grep -P '^>' sacCer3.fa | wc -l

or Or use grep's -c option that says "just count the line matches"

Code Block
languagebash
grep -P -c '^>' sacCer3.fa


Expand
titleAnswer

There are 17 contigs.

Performing the bwa alignment

Now, we're ready to execute the actual alignment, with the goal of initially producing a SAM file from the input FASTQ files and reference. First go to the align directory, and link to the sacCer3 reference directory (this will make our commands more readable).

...

What does bwa aln needs in the way of arguments?

Expand
titleHint
Code Block
languagebash
bwa aln

There are lots of options, but here is a summary of the most important ones. BWA, is a lot more complex than the options let on. If you look at the BWA manual on the web for the aln sub-command, you'll see numerous options that can increase the alignment rate (as well as decrease it), and all sorts of other things. 

...

You did it!  You should now have a SAM file that contains the alignments.  ItIt's just a text file, so take a look with head, more, less, tail, or whatever you feel like.  In In the next section, with samtools, you'll learn some additional ways to analyze the data once you create a BAM file.

...

Expand
titleHint

This looks for the pattern  '^HWI' which is the start of every read name (which starts every alignment record).
Remember -c says just count the records, don't display them.

Code Block
languagebash
grep -P -c '^@' yeast_pairedend.sam 

Or use the -v (invert) option to tell grep to print all lines that don't match a particular pattern, here the header lines starting with @.

Code Block
languagebash
grep -P -v -c '^HWI' yeast_pairedend.sam



Expand
titleAnswer
There are 1184360 1,184,360 alignment records.

Exercise: How many sequences were in the R1 and R2 FASTQ files combined?

...

Expand
titleAnswer
There were a total of 1184360 1,184,360 original sequences

Exercises:

...

Code Block
titlebowtie2 index files for miRNAs
hairpin_cDNA_hsa.fa
hairpin_cDNA_hsa.fa.1.bt2
hairpin_cDNA_hsa.fa.2.bt2
hairpin_cDNA_hsa.fa.3.bt2
hairpin_cDNA_hsa.fa.4.bt2
hairpin_cDNA_hsa.fa.rev.1.bt2
hairpin_cDNA_hsa.fa.rev.2.bt2

Performing the bowtie2 local alignment

Now, we're ready to actually try to do the alignment.  Remember, unlike BWA, we actually need to set some options depending on what we're after. Some of the important options for bowtie2 are:

...

  • This is one alignment record, although it has been broken up below for readability.
  • This read mapped to the mature microRNA sequence hsa-mir-302b, starting at base 50 in that contig.
  • Notice the CIGAR string is 3S20M13S, meaning that 3 bases were soft clipped from one end (3S), and 13 from the other (13S).
    • If we did the same alignment using either bowtie2 --end-to-end mode, or using bwa aln as in Exercise #1, very little of this file would have aligned.
  • The 20M part of the CIGAR string says there was a block of 20 read bases that mapped to the reference.
    • If we had not lowered the seed parameter of bowtie2 from its default of 22, we would not have found many of the alignments like this one that only matched for 20 bases.

...

The test file we will be working with is just the R1 file from a paired-end total RNA-seq experiment, meaning it is (for our purposes) single-end. Go ahead and take a look at it, and find out how many reads are in the file.

Expand
titleHint:
Code Block
languagebash
cds
less fastq_align/human_rnaaseq.fastq.gz
gunzip -c fastq_align/human_rnaseq.fastq.gz | echo $((`wc -l`/4))

RNA-seq alignment with bwa aln

...

Code Block
bwa mem <ref.fa> <reads.fq> > outfile.sam
Expand
titleHint:
Code Block
languagebash
bwa mem hg19/hg19.fa fq/human_rnaseq.fastq.gz > human_rnaseq_mem.sam
 

Check the length of the SAM file you generated with wc -l. Since there is one alignment per line, there must be 586266 alignments (minus no more than 100 header lines), which is more than the number of sequences in the FASTQ file. This is bwa mem can report multiple alignment records for the same read, hopefully on either side of a splice junction. These alignments can still be tied together because they have the same read ID.

...