View Source

Differential expression with splice variant analysis at the same time: the Tophat/Cufflinks workflow

In this lab, you will explore a fairly typical RNA-seq analysis workflow. Simulated RNA-seq data will be provided to you; the data contains 75 bp paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). This simulated data has already been run through a basic RNA-seq analysis workflow. We will look at:

i. How the workflow was run and what steps are involved.
ii. What genes and isoforms are significantly differentially expressed

Six raw data files were provided as the starting point:
c1_r1, c1_r2, c1_r3 from the first biological condition, and c2_r1, c2_r2, and c2_r3 from the second.

Due to the size of the data and length of run time, *most of the programs have already been run for this exercise*. The commands run are in different *.commands files. We will spend some time looking through these commands to understand them. You will then be parsing the output, finding answers, and visualizing results.

Introduction

An overview of tophat cufflinks workflow

Some Logistics...

This section is all located under $BI/ngs_course/tophat_cufflinks/ So cd into this directory now.

Commands used are in different *.commands files located in $BI/ngs_course/tophat_cufflinks/run_commands

Some output, like bam files can be gotten from the URL http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/ and directly loaded into IGV, using Load from URL.

If you generate your own output and would like to view them on IGV, you will need to scp the files from lonestar to your computer.

On your computer's side:

Go to the directory where you want to copy files to.

 scp my_user_name@lonestar.tacc.utexas.edu:/home/.../stuff.fastq ./

Replace the "/home/..." with the "pwd" information obtained earlier.

This command would transfer "stuff.fastq" from the specified directory on Lonestar to your current directory on your computer.

For help, remember to type the commands and hit enter to see what help they can offer you. You will also almost certainly need to consult the documentation for tophat, cufflinks and cummeRbund:
http://tophat.cbcb.umd.edu/manual.html
http://cufflinks.cbcb.umd.edu/manual.html
http://compbio.mit.edu/cummeRbund/manual.html

The workflow

a) map reads against the Drosophila genome (using tophat) (this has already been done to save time)
b) assemble the putative transcripts (using cufflinks) (this has already been done to save time)
c) merge the assemblies (using cuffmerge) (this has already been done to save time)
d) compute differential expression (using cuffdiff) (this has already been done to save time)
e) inspect the results
f) examine the differential expression results (using unix, IGV and cummeRbund)
Extra: g) compare assembled transcripts to annotated transcripts to identify potentially novel ones (using cuffcmp)

a) and b) Running tophat and cufflinks: Look at run_commands/tophat.commands and run_commands/cufflinks.commands to see how tophat and cufflinks were run.

cat $BI/tophat_cufflinks/run_commands/tophat.commands
cat $BI/tophat_cufflinks/run_commands/cufflinks.commands

On lonestar, to run tophat, cufflinks etc, following modules need to be loaded.

module load boost/1.45.0
module load bowtie
module load tophat
module load cufflinks/2.0.2

tophat [options] <bowtie_index_prefix> <reads1> <reads2>

-rwxr-xr-x 1 daras G-803889 323M Aug 16 11:47 accepted_hits.bam
-r-xr-xr-x 1 daras G-803889 237K Aug 16 11:46 accepted_hits.bam.bai

-rwxr-xr-x 1 daras G-803889   52 Aug 16 11:46 deletions.bed
-rwxr-xr-x 1 daras G-803889   54 Aug 16 11:46 insertions.bed
-rwxr-xr-x 1 daras G-803889 2.9M Aug 16 11:46 junctions.bed
-rwxr-xr-x 1 daras G-803889   70 Aug 16 11:46 left_kept_reads.info
drwxr-xr-x 2 daras G-803889  32K Aug 16 11:46 logs
-rwxr-xr-x 1 daras G-803889   70 Aug 16 11:46 right_kept_reads.info
-rwxr-xr-x 1 daras G-803889 9.7K Aug 16 11:46 unmapped_left.fq.z
-rwxr-xr-x 1 daras G-803889 9.9K Aug 16 11:46 unmapped_right.fq.z

cufflinks [options] <hits.bam>

-rwxr-xr-x 1 daras G-803889  14M Aug 16 12:49 transcripts.gtf

-rwxr-xr-x 1 daras G-803889 597K Aug 16 12:49 genes.fpkm_tracking
-rwxr-xr-x 1 daras G-803889 960K Aug 16 12:49 isoforms.fpkm_tracking
-rwxr-xr-x 1 daras G-803889    0 Aug 16 12:33 skipped.gtf

Exercise 1: If I wanted to provide a trancript annotation file (gtf file) in cufflinks command, what would I add to the command? Remember that you can type the command and hit enter to get help info about it.
i. If I wanted cufflinks to not assemble any novel transcripts outside of what is in the gtf file
ii. If I wanted cufflinks to assemble novel as well as annotated transcripts?

-G to use only the annotated transcripts in the gtf file
-g to use the annotated transcripts in the gtf file as a guide, but also assemble novel transcripts

c) Merging assemblies using cuffmerge

find . -name transcripts.gtf > assembly_list.txt
cuffmerge <assembly_list.txt>

The most important file is:
-r-xr-xr-x 1 daras G-803889 32090408 Aug 16 20:11 merged.gtf

d) Finding differentially expressed genes and isoforms using cuffdiff

cuffdiff [options] <merged.gtf> <sample1_rep1.bam,sample1_rep2.bam> <sample2_rep1.bam,sample2_rep2.bam>

Exercise 2: What does cuffdiff -b do?

-b is for enabling fragment bias correction.

We will be inspecting the cuffdiff results further in a little bit.

e) Inspect the mapped results

Before you even start, do a sanity check on your data by examining the bam files from the mapping output.

I've included the directory "bwa_genome" containing the output from bwa of the C1_R1_1 and C1_R1_2 files mapped directly to the genome using bwa. Tophat output is in C1_R1_thout for C1_R1 sample. Report the total number of mapped reads for each sample using "samtools flagstat" for the bwa output (bam file) and the tophat output (bam file). There is a big difference between the bam file from tophat and the bam file from bwa. Can you spot it?

samtools flagstat C1_R1.bam

You might want to load the bwa bam file on IGV alongside the tophat bam file for the same sample to see the differences between mapping to the transcriptome and mapping to the genome.

Genome track

Use D.melanogaster (dm3)

Data files (use Load from URL option)

Load tophat bam file from URL: [http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/|http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/]C1_R1_accepted_hits.bam
* Load bwa bam file from URL: [http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/|http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/]bwa_genome/C1_R1.bam

Now, lets load another bam file output from the tophat run (remember, tophat calls bowtie for mapping). These are large file, so it may slow your computer down. Since we've already loaded C1_R1, let's load one from the C2 condition, say, C2_R1, to see genuine expression level changes. Go to gene regucalcin (remember, you can just type the gene name to get there) and see if it is differentially expressed between C1_R1 and C2_R1.
Keep this open to come back to, when we further explore differential expression results.

wget http://www.broadinstitute.org/igv/projects/downloads/IGV_2.1.14.zip
unzip IGV_2.1.14.zip
cd IGV_2.1.14
java -Xmx2g -jar igv.jar

f) Examine the differential expression analysis results

The cuffdiff output is in a directory called diff_out. We are going to spend some time parsing through this output. So, copy it over to your scratch directory and move to your SCRATCH directory.

cp \-r diff_out $SCRATCH
cd $SCRATCH/diff_out

Note that cuffdiff has performed a statistical test on the expression values between our two biological groups. It reports the FPKM expression levels for each group, the log2(group 1 FPKM/ group 2 FPKM), and a p-value measure of statistical confidence, among many other helpful data items.

The most important file is:

-rwxr-x--- 1 daras G-801020  1483520 Aug 21 12:20 gene_exp.diff-rwxr-x--- 1 daras G-801020  2691192 Aug 21 12:20 isoform_exp.diff-rwxr-x--- 1 daras G-801020  1729831 Aug 21 12:20 tss_group_exp.diff-rwxr-x--- 1 daras G-801020  1369451 Aug 21 12:20 cds_exp.diff
-rwxr-x--- 1 daras G-801020   919690 Aug 21 12:20 cds.diff


-rwxr-x--- 1 daras G-801020  1628659 Aug 21 12:20 genes.fpkm_tracking
-rwxr-x--- 1 daras G-801020  3277177 Aug 21 12:20 isoforms.fpkm_tracking
-rwxr-x--- 1 daras G-801020  1885773 Aug 21 12:20 tss_groups.fpkm_tracking
-rwxr-x--- 1 daras G-801020  1477492 Aug 21 12:20 cds.fpkm_tracking

-rwxr-x--- 1 daras G-801020  1158560 Aug 21 12:20 promoters.diff
-rwxr-x--- 1 daras G-801020  1349574 Aug 21 12:20 splicing.diff

-rwxr-x--- 1 daras G-801020   919690 Aug 21 12:20 cds.diff

Here is a basic command useful for parsing/sorting the gene_exp.diff or isoform_exp.diff files:

cat isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head

Exercise 3: Find the 10 most up-regulated genes, by fold change. Look at one example of a up-regulated gene, regucalcin, on IGV. Find the 10 most down-regulated genes, by fold change. Look at one example of a down-regulated gene, on IGV.

Top 10 upregulated genes
scf
CG8979
CG4389
crc
KdelR
Vha68-2
CG3835
Df31
by
Dhpr

Top 10 downregulated genes
Git
CG1657
Myo31DF
RpII215
RpL9
Pde8
Dsp1
Bsg
eIF-4E
HDAC6,dah

cat gene_exp.diff |grep 'yes'|sort -k10nr,10|head

cat gene_exp.diff |grep 'yes'|sort -k10n,10|head

Exercise 4: Find the 10 most up-regulated isoforms, by fold change. What genes do they belong to?

simj
CG2177
sPLA2
Nipsnap
Pde8
by
CG15814
Dhpr
eIF-4E
spir

cat isoform_exp.diff |grep 'yes'|sort -k10nr,10|head

Using cummeRbund to inspect differential expression data.

Step 1: Load R and enter R environment

module load R
R

Step 2: Within R environment, set up cummeRbund

source("http://bioconductor.org/biocLite.R")
biocLite("cummeRbund")

Step 3: Load cummeRbund library and read in the differential expression results.

library(cummeRbund)
cuff_data <- readCufflinks('diff_out')

Step 4: Use cummeRbund to visualize the differential expression results

gene_diff_data  <- diffData(genes(cuff_data))
sig_gene_data  <- subset(gene_diff_data, (significant ==  'yes'))
nrow(sig_gene_data)
isoform_diff_data <-diffData(isoforms(cuff_data), 'C1', 'C2')
sig_isoform_data <- subset(isoform_diff_data, (significant == 'yes'))
nrow(sig_isoform_data)

csScatter(genes(cuff_data), 'C1', 'C2')

mygene1 <- getGene(cuff_data,'regucalcin')
expressionBarplot(mygene1)
expressionBarplot(isoforms(mygene1))

mygene2 <- getGene(cuff_data, 'Rala')
expressionBarplot(mygene2)
expressionBarplot(isoforms(mygene2))

Take cummeRbund for a spin...

CummeRbund is powerful package with many different functions. Above was an illustration of a few of them. Try any of the suggested exercises below to further explore the differential expression results with different cummeRbund functions.

If you would rather just look at the resulting graphs ,they are at the URL: http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/ as exercise5_Rplots.pdf, exercise6_Rplots.pdf, and exercise7_Rplots.pdf.

You can refer to the cummeRbund manual for more help and remember that ?functionName will provide syntax information for different functions.
http://compbio.mit.edu/cummeRbund/manual.html

You may need to redo Step 3. everytime you reopen an R session.

Exercise 5: Visualize the distribution of fpkm values across the two different conditions using a boxplot.

csBoxplot(genes(cuff_data))

csBoxplot(isoforms(cuff_data))

Use csBoxplot function on cuff_data object to generate a boxplot of gene or isoform level fpkms.

Exercise 6: Visualize the significant vs non-significant genes using a volcano plot.

csVolcano(genes(cuff_data), "C1", "C2")

Use csVolcano function on cuff_data object to generate a volcano plot.

Exercise 7: Pull out a subset of the genes using a p_value cutoff of 0.01. Generate expression bar plots for just those genes.

gene_diff_data  <\- diffData(genes(cuff_data))
sig_gene_data  <\- subset(gene_diff_data, (ln_fold_change > 1.5))
head(sig_gene_data)
sig_geneids <\- c(sig_gene_data\[1\]$gene_id)

myGenes <- getGenes(cuff_data, sig_geneids)
expressionBarplot(myGenes)

If you have trouble sourcing cummeRbund, try this:
module swap gcc intel
Reenter R and redo above steps.