Differential expression with splice variant analysis aug2012

Differential expression with splice variant analysis at the same time: the Tophat/Cufflinks workflow

In this lab, you will explore a fairly typical RNA-seq analysis workflow. Simulated RNA-seq data will be provided to you; the data contains 75 bp paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). This simulated data has already been run through a basic RNA-seq analysis workflow. We will look at:

i. How the workflow was run and what steps are involved.
ii. What genes and isoforms are significantly differentially expressed

Six raw data files were provided as the starting point:
c1_r1, c1_r2, c1_r3 from the first biological condition, and c2_r1, c2_r2, and c2_r3 from the second.

Due to the size of the data and length of run time, *most of the programs have already been run for this exercise*. The commands run are in different *.commands files. We will spend some time looking through these commands to understand them. You will then be parsing the output, finding answers, and visualizing results.

Introduction

An overview of tophat cufflinks workflow

Some Logistics...

This section is all located under $BI/ngs_course/tophat_cufflinks/ So cd into this directory now.

Commands used are in different *.commands files located in $BI/ngs_course/tophat_cufflinks/run_commands

Some output, like bam files can be gotten from the URL http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/ and directly loaded into IGV, using Load from URL.

If you generate your own output and would like to view them on IGV, you will need to scp the files from lonestar to your computer.

Reminder on how to scp files from lonestar

On your computer's side:

Go to the directory where you want to copy files to.

 scp my_user_name@lonestar.tacc.utexas.edu:/home/.../stuff.fastq ./

Replace the "/home/..." with the "pwd" information obtained earlier.

This command would transfer "stuff.fastq" from the specified directory on Lonestar to your current directory on your computer.

For help, remember to type the commands and hit enter to see what help they can offer you. You will also almost certainly need to consult the documentation for tophat, cufflinks and cummeRbund:
http://tophat.cbcb.umd.edu/manual.html
http://cufflinks.cbcb.umd.edu/manual.html
http://compbio.mit.edu/cummeRbund/manual.html

The workflow

a) map reads against the Drosophila genome (using tophat) (this has already been done to save time)
b) assemble the putative transcripts (using cufflinks) (this has already been done to save time)
c) merge the assemblies (using cuffmerge) (this has already been done to save time)
d) compute differential expression (using cuffdiff) (this has already been done to save time)
e) inspect the results
f) examine the differential expression results (using unix, IGV and cummeRbund)
Extra: g) compare assembled transcripts to annotated transcripts to identify potentially novel ones (using cuffcmp)

a) and b) Running tophat and cufflinks: Look at run_commands/tophat.commands and run_commands/cufflinks.commands to see how tophat and cufflinks were run.

How do I look into the files?

cat $BI/tophat_cufflinks/run_commands/tophat.commands
cat $BI/tophat_cufflinks/run_commands/cufflinks.commands

On lonestar, to run tophat, cufflinks etc, following modules need to be loaded.

module load boost/1.45.0
module load bowtie
module load tophat
module load cufflinks/2.0.2

General syntax for tophat command

tophat [options] <bowtie_index_prefix> <reads1> <reads2>

Take a minute to look at the output files produced by one tophat run.

The most important files are in bold.

-rwxr-xr-x 1 daras G-803889 323M Aug 16 11:47 accepted_hits.bam
-r-xr-xr-x 1 daras G-803889 237K Aug 16 11:46 accepted_hits.bam.bai
-rwxr-xr-x 1 daras G-803889   52 Aug 16 11:46 deletions.bed
-rwxr-xr-x 1 daras G-803889   54 Aug 16 11:46 insertions.bed
-rwxr-xr-x 1 daras G-803889 2.9M Aug 16 11:46 junctions.bed
-rwxr-xr-x 1 daras G-803889   70 Aug 16 11:46 left_kept_reads.info
drwxr-xr-x 2 daras G-803889  32K Aug 16 11:46 logs
-rwxr-xr-x 1 daras G-803889   70 Aug 16 11:46 right_kept_reads.info
-rwxr-xr-x 1 daras G-803889 9.7K Aug 16 11:46 unmapped_left.fq.z
-rwxr-xr-x 1 daras G-803889 9.9K Aug 16 11:46 unmapped_right.fq.z

General syntax for cufflinks command

cufflinks [options] <hits.bam>

Take a minute to look at the output files produced by one cufflinks run.

The most important files are in bold.

-rwxr-xr-x 1 daras G-803889  14M Aug 16 12:49 transcripts.gtf
-rwxr-xr-x 1 daras G-803889 597K Aug 16 12:49 genes.fpkm_tracking
-rwxr-xr-x 1 daras G-803889 960K Aug 16 12:49 isoforms.fpkm_tracking
-rwxr-xr-x 1 daras G-803889    0 Aug 16 12:33 skipped.gtf

Exercise 1: If I wanted to provide a trancript annotation file (gtf file) in cufflinks command, what would I add to the command? Remember that you can type the command and hit enter to get help info about it.
i. If I wanted cufflinks to not assemble any novel transcripts outside of what is in the gtf file
ii. If I wanted cufflinks to assemble novel as well as annotated transcripts?

Solution

-G to use only the annotated transcripts in the gtf file
-g to use the annotated transcripts in the gtf file as a guide, but also assemble novel transcripts

c) Merging assemblies using cuffmerge

find . -name transcripts.gtf > assembly_list.txt
cuffmerge <assembly_list.txt>

Take a minute to look at the output files produced by cuffmerge.

The most important file is:
-r-xr-xr-x 1 daras G-803889 32090408 Aug 16 20:11 merged.gtf*

d) Finding differentially expressed genes and isoforms using cuffdiff

cuffdiff [options] <merged.gtf> <sample1_rep1.bam,sample1_rep2.bam> <sample2_rep1.bam,sample2_rep2.bam>

Exercise 2: What does cuffdiff -b do?

Solution

-b is for enabling fragment bias correction.

We will be inspecting the cuffdiff results further in a little bit.

e) Inspect the mapped results

Before you even start, do a sanity check on your data by examining the bam files from the mapping output.

I've included the directory "bwa_genome" containing the output from bwa of the C1_R1_1 and C1_R1_2 files mapped directly to the genome using bwa. Tophat output is in C1_R1_thout for C1_R1 sample. Report the total number of mapped reads for each sample using "samtools flagstat" for the bwa output (bam file) and the tophat output (bam file). There is a big difference between the bam file from tophat and the bam file from bwa. Can you spot it?

Reminder on how to run samtools flagstat

samtools flagstat C1_R1.bam

You might want to load the bwa bam file on IGV alongside the tophat bam file for the same sample to see the differences between mapping to the transcriptome and mapping to the genome.

Genome track

Use D.melanogaster (dm3)

Data files (use Load from URL option)

Load tophat bam file from URL: [http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/|http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/]C1_R1_accepted_hits.bam
* Load bwa bam file from URL: [http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/|http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/]bwa_genome/C1_R1.bam

Now, lets load another bam file output from the tophat run (remember, tophat calls bowtie for mapping). These are large file, so it may slow your computer down. Since we've already loaded C1_R1, let's load one from the C2 condition, say, C2_R1, to see genuine expression level changes. Go to gene regucalcin (remember, you can just type the gene name to get there) and see if it is differentially expressed between C1_R1 and C2_R1.
Keep this open to come back to, when we further explore differential expression results.

Reminder on how to download and run IGV

Downloading and running IGV

wget http://www.broadinstitute.org/igv/projects/downloads/IGV_2.1.14.zip
unzip IGV_2.1.14.zip
cd IGV_2.1.14
java -Xmx2g -jar igv.jar

f) Examine the differential expression analysis results

The cuffdiff output is in a directory called diff_out. We are going to spend some time parsing through this output. So, copy it over to your scratch directory and move to your SCRATCH directory.

cp -r diff_out $SCRATCH

cd $SCRATCH/diff_out

Note that cuffdiff has performed a statistical test on the expression values between our two biological groups. It reports the FPKM expression levels for each group, the log2(group 1 FPKM/ group 2 FPKM), and a p-value measure of statistical confidence, among many other helpful data items.

Take a minute to look at the output files produced by cuffmerge.

The most important file is:

-rwxr-x--- 1 daras G-801020  1483520 Aug 21 12:20 gene_exp.diff-rwxr-x--- 1 daras G-801020  2691192 Aug 21 12:20 isoform_exp.diff-rwxr-x--- 1 daras G-801020  1729831 Aug 21 12:20 tss_group_exp.diff-rwxr-x--- 1 daras G-801020  1369451 Aug 21 12:20 cds_exp.diff
-rwxr-x--- 1 daras G-801020   919690 Aug 21 12:20 cds.diff


-rwxr-x--- 1 daras G-801020  1628659 Aug 21 12:20 genes.fpkm_tracking
-rwxr-x--- 1 daras G-801020  3277177 Aug 21 12:20 isoforms.fpkm_tracking
-rwxr-x--- 1 daras G-801020  1885773 Aug 21 12:20 tss_groups.fpkm_tracking
-rwxr-x--- 1 daras G-801020  1477492 Aug 21 12:20 cds.fpkm_tracking

-rwxr-x--- 1 daras G-801020  1158560 Aug 21 12:20 promoters.diff
-rwxr-x--- 1 daras G-801020  1349574 Aug 21 12:20 splicing.diff

-rwxr-x--- 1 daras G-801020   919690 Aug 21 12:20 cds.diff

Here is a basic command useful for parsing/sorting the gene_exp.diff or isoform_exp.diff files:

Linux one-liner for sorting cuffdiff output by log2 fold-change values

cat isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head

Exercise 3: Find the 10 most up-regulated genes, by fold change. Look at one example of a up-regulated gene, regucalcin, on IGV. Find the 10 most down-regulated genes, by fold change. Look at one example of a down-regulated gene, on IGV.

Solution

Top 10 upregulated genes
scf
CG8979
CG4389
crc
KdelR
Vha68-2
CG3835
Df31
by
Dhpr

Top 10 downregulated genes
Git
CG1657
Myo31DF
RpII215
RpL9
Pde8
Dsp1
Bsg
eIF-4E
HDAC6,dah

Hint

One-line command to get 10 most up regulated genes

cat gene_exp.diff |grep 'yes'|sort -k10nr,10|head

One-line command to get 10 most down regulated genes

cat gene_exp.diff |grep 'yes'|sort -k10n,10|head

Exercise 4: Find the 10 most up-regulated isoforms, by fold change. What genes do they belong to?

Solution

simj
CG2177
sPLA2
Nipsnap
Pde8
by
CG15814
Dhpr
eIF-4E
spir

Hint

One-line command to get 10 most up-regulated isoforms

cat isoform_exp.diff |grep 'yes'|sort -k10nr,10|head

Using cummeRbund to inspect differential expression data.

Step 1: Load R and enter R environment

module load R
R

Step 2: Within R environment, set up cummeRbund

source("http://bioconductor.org/biocLite.R")
biocLite("cummeRbund")

Step 3: Load cummeRbund library and read in the differential expression results.

library(cummeRbund)
cuff_data <- readCufflinks('diff_out')

Step 4: Use cummeRbund to visualize the differential expression results

To pull out significantly differentially expressed genes and isoforms

gene_diff_data  <- diffData(genes(cuff_data))
sig_gene_data  <- subset(gene_diff_data, (significant ==  'yes'))
nrow(sig_gene_data)
isoform_diff_data <-diffData(isoforms(cuff_data), 'C1', 'C2')
sig_isoform_data <- subset(isoform_diff_data, (significant == 'yes'))
nrow(sig_isoform_data)

To draw a scatterplot

csScatter(genes(cuff_data), 'C1', 'C2')

To plot gene level and isoform level expression for gene regucalcin

mygene1 <- getGene(cuff_data,'regucalcin')
expressionBarplot(mygene1)
expressionBarplot(isoforms(mygene1))

To plot gene level and isoform level expression for gene Rala

mygene2 <- getGene(cuff_data, 'Rala')
expressionBarplot(mygene2)
expressionBarplot(isoforms(mygene2))

Take cummeRbund for a spin...

CummeRbund is powerful package with many different functions. Above was an illustration of a few of them. Try any of the suggested exercises below to further explore the differential expression results with different cummeRbund functions.

If you would rather just look at the resulting graphs ,they are at the URL: http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/ as exercise5_Rplots.pdf, exercise6_Rplots.pdf, and exercise7_Rplots.pdf.

You can refer to the cummeRbund manual for more help and remember that ?functionName will provide syntax information for different functions.
http://compbio.mit.edu/cummeRbund/manual.html

You may need to redo Step 3. everytime you reopen an R session.

Exercise 5: Visualize the distribution of fpkm values across the two different conditions using a boxplot.

Solution

R command to generate box plot of gene level fpkms

csBoxplot(genes(cuff_data))

R command to generate box plot of isoform level fpkms

csBoxplot(isoforms(cuff_data))

Hint

Use csBoxplot function on cuff_data object to generate a boxplot of gene or isoform level fpkms.

Exercise 6: Visualize the significant vs non-significant genes using a volcano plot.

Solution

csVolcano(genes(cuff_data), "C1", "C2")

Hint

Use csVolcano function on cuff_data object to generate a volcano plot.

Exercise 7: Pull out a subset of the genes using a p_value cutoff of 0.01. Generate expression bar plots for just those genes.

Solution

One possible solution

gene_diff_data <- diffData(genes(cuff_data))
sig_gene_data <- subset(gene_diff_data, (ln_fold_change > 1.5))
head(sig_gene_data)
sig_geneids <- c(sig_gene_data[1]$gene_id)

myGenes <- getGenes(cuff_data, sig_geneids)
expressionBarplot(myGenes)

If you have trouble sourcing cummeRbund, try this:
module swap gcc intel
Reenter R and redo above steps.

Space shortcuts

Page tree

Differential expression with splice variant analysis at the same time: the Tophat/Cufflinks workflow

Introduction

Some Logistics...

The workflow

a) and b) Running tophat and cufflinks: Look at run_commands/tophat.commands and run_commands/cufflinks.commands to see how tophat and cufflinks were run.

c) Merging assemblies using cuffmerge

d) Finding differentially expressed genes and isoforms using cuffdiff

e) Inspect the mapped results

Before you even start, do a sanity check on your data by examining the bam files from the mapping output.

f) Examine the differential expression analysis results

Using cummeRbund to inspect differential expression data.

Take cummeRbund for a spin...