You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 44 Next »

Differential expression with splice variant analysis at the same time: the Tophat/Cufflinks workflow

In this lab, you will explore a fairly typical RNA-seq analysis workflow. Simulated RNA-seq data will be provided to you; the data contains 75 bp paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). This simulated data has already been run through a basic RNA-seq analysis workflow. We will look at:

i. How the workflow was run and what steps are involved.
ii. What genes and isoforms are significantly differentially expressed ?

Six raw data files were provided as the starting point: c1_r1, c1_r2, c1_r3 from the first biological group, and c2_r1, c2_r2, and c2_r3 from the second.

Due to the size of the data and length of run time, *most of the programs have already been run for this exercise*. The commands run are in different *.commands files. We will spend some time looking through these commands to understand them. You will then be parsing the output, finding answers, and visualizing results.

Introduction

An overview of tophat cufflinks workflow

mention igenomes: http://tophat.cbcb.umd.edu/igenomes.html

The workflow

a) map reads against the Drosophila genome (using tophat) (this has already been done to save time)
b) assemble the putative transcripts (using cufflinks) (this has already been done to save time)
c) merge the assemblies (using cuffmerge) (this has already been done to save time)
d) compute differential expression (using cuffdiff) and find genes and isoforms significantly differentially expressed (by parsing output)
e) inspect the results
f) examine the differential expression results (using unix, IGV and cummeRbund)
Extra: g) compare assembled transcripts to annotated transcripts to identify potentially novel ones (using cuffcmp)

You will almost certainly need to consult the documentation for tophat and cufflinks and cummeRbund:
http://tophat.cbcb.umd.edu/manual.html
http://cufflinks.cbcb.umd.edu/manual.html
http://compbio.mit.edu/cummeRbund/manual.html

a) and b) Running tophat and cufflinks: Look at tophat.commands and cufflinks.commands to see how tophat and cufflinks were run.

tophat [options] <bowtie_index_prefix> <reads1> <reads2>
cufflinks [options] <hits.bam>

On lonestar, to run tophat, cufflinks etc, following modules need to be loaded.

module load boost/1.45.0
module load bowtie
module load tophat
module load cufflinks/2.0.2

Exercise: If I wanted to provide a trancript annotation file (gtf file) in cufflinks command, what would I add to the command?
i. If I wanted cufflinks to not assemble any novel transcripts outside of what is in the gtf file
ii. If I wanted cufflinks to assemble novel as well as annotated transcripts?

-G to use only the annotated transcripts in the gtf file
-g to use the annotated transcripts in the gtf file as a guide, but also assemble novel transcripts


c) Merging assemblies using cuffmerge

find . -name transcripts.gtf > assembly_list.txt
cuffmerge <assembly_list.txt>


d) Finding differentially expressed genes and isoforms using cuffdiff

cuffdiff [options] <merged.gtf> <sample1_rep1.bam,sample1_rep2.bam> <sample2_rep1.bam,sample2_rep2.bam>

Exercise: What does cuffdiff -b do?

-b is for enabling fragment bias correction.

e) Inspect the mapped results

Before you even start, do a sanity check on your data by examining the bam files from the mapping output.

I've included the directory "bwa_genome" containing the output from bwa of the C1_R1_1 and C1_R1_2 files mapped directly to the genome using bwa. You might want to load this into IGV alongside the tophat bam file for the same sample to see the differences between mapping to the transcriptome and mapping to the genome (like you lose a bunch of reads).

Report the total number of input reads and the total number of mapped reads for each sample using "samtools flagstat". There is a big difference between the bam file from tophat and the bam file from bwa. Can you spot it? For C1_R1, compare the percentage mapped to the genome to the percentage mapped via tophat/bowtie to the genome.

Now, load ONE of the bam file outputs from the tophat run (remember, tophat calls bowtie for mapping). These are large files; you can load more than one, just be prepared that it may slow your computer down. If you want to load two, I'd suggest loading one from C1 and one from C2 so you see genuine expression level changes.

Downloading and running IGV
wget http://www.broadinstitute.org/igv/projects/downloads/IGV_2.1.14.zip
unzip IGV_2.1.14.zip
cd IGV_2.1.14
java -Xmx2g -jar igv.jar

f) Examine the differential expression analysis results

Find the cuffdiff output (either by understanding cuffdiff.commands or by looking) and by looking at it and/or reading the documentation find the isoforms and genes that are differentially expressed. Note that cuffdiff has performed a statistical test on the expression values between our two biological groups. It reports the FPKM expression levels for each group, the log2(group 1 FPKM/ group 2 FPKM), and a p-value measure of statistical confidence, among many other helpful data items.

Exercise: Find the 10 most up-regulated genes, by fold change. Look at one example of a up-regulated gene, regucalcin, on IGV. Find the 10 most down-regulated genes, by fold change. Look at one example of a down-regulated gene, on IGV.

One-line command to get 10 most up regulated genes
cat gene_exp.diff |grep 'yes'|sort -k10nr,10|head
One-line command to get 10 most down regulated genes
cat gene_exp.diff |grep 'yes'|sort -k10n,10|head

Here is a basic command useful for parsing/sorting the gene_exp.diff or isoform_exp.diff files:

Linux one-liner for sorting cuffdiff output by log2 fold-change values
cat isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head

Exercise: Find the 10 most up-regulated isoforms, by fold change. What genes do they belong to?

One-line command to get 10 most up-regulated isoforms
cat isoform_exp.diff |grep 'yes'|sort -k10nr,10|head

Using cummeRbund to inspect differential expression data.

Step 1: Load R and enter R environment

module load R
R

Step 2: Within R environment, load cummeRbund

source("http://bioconductor.org/biocLite.R")
biocLite("cummeRbund")
library(cummeRbund)

Step 3: Use cummeRbund to load and visualize the differential expression results

cuff_data <- readCufflinks('diff_out')
To draw a scatterplot
csScatter(genes(cuff_data), 'C1', 'C2')
To pull out significantly differentially expressed genes and isoforms
gene_diff_data  <- diffData(genes(cuff_data))
sig_gene_data  <- subset(gene_diff_data, (significant ==  'yes'))
nrow(sig_gene_data)
isoform_diff_data <-diffData(isoforms(cuff_data), 'C1', 'C2')
sig_isoform_data <- subset(isoform_diff_data, (significant == 'yes'))
nrow(sig_isoform_data)
To plot gene level and isoform level expression for gene regucalcin
mygene1 <- getGene(cuff_data,'regucalcin')
expressionBarplot(mygene1)
expressionBarplot(isoforms(mygene1))
To plot gene level and isoform level expression for gene Rala
mygene2 <- getGene(cuff_data, 'Rala')
expressionBarplot(mygene2)
expressionBarplot(isoforms(mygene2))

add optional exrcises for plotting volcano plot, boxplot, heatmap,
If you have trouble sourcing cummeRbund, try this:
module swap gcc intel
Reenter R and redo above steps.

  • No labels