Differential expression with splice variant analysis at the same time: the Tuxedo pipeline
The Tuxedo Pipeline is a suite of tools for RNA-seq analysis, also known as the Tophat/Cufflinks workflow. It can be run in a variety of ways, optionally including de novo splice variant discovery. If an adequate set of splice variants is also available, it can also be run without splice variant detection to perform simple differential gene expression.
Useful RNA-seq resources are summarized on our Resources tool list, Transcriptome analaysis section. The most important of these resources for Tuxedo are:
- the original RNAseq analysis protocol using Tuxedo article in Nature Protocols, and
- the URL for Tuxedo resource bundles for selected organisms (gff annotations, pre-built bowtie references, etc.)
Objectives
In this lab, you will explore a fairly typical RNA-seq analysis workflow using the Tuxedo pipeline. Simulated RNA-seq data will be provided to you; the data contains 75 bp paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). This simulated data has already been run through a basic RNA-seq analysis workflow. We will look at:
- How the workflow was run and what steps are involved.
- What genes and isoforms are significantly differentially expressed
Six raw data files were provided as the starting point:
- c1_r1, c1_r2, c1_r3 from the first biological condition
- c2_r1, c2_r2, and c2_r3 from the second biological condition
Due to the size of the data and length of run time, most of the programs have already been run for this exercise. The commands run are in different *.commands files. We will spend some time looking through these commands to understand them. You will then be parsing the output, finding answers, and visualizing results.
Introduction
An overview of tophat cufflinks workflow
This is the full workflow that includes de novo splice variant detection. For simple differential gene expression, Steps 2 (cufflinks) and 3-4 (cuffmerge) can be omitted.
Some Logistics...
This section is all located under $BI/ngs_course/tophat_cufflinks/ So cd into this directory now.
Commands used are in different *.commands files located in $BI/ngs_course/tophat_cufflinks/run_commands
Some output, like bam files can be gotten from the URL http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/ and directly loaded into IGV, using Load from URL.
If you generate your own output and would like to view them on IGV, you will need to scp the files from lonestar to your computer.
For help, remember to type the commands and hit enter to see what help they can offer you. You will also almost certainly need to consult the documentation for tophat, cufflinks and cummeRbund:
- http://tophat.cbcb.umd.edu/manual.html
- http://cufflinks.cbcb.umd.edu/manual.html
- http://compbio.mit.edu/cummeRbund/manual.html
The workflow
a) map reads against the Drosophila genome (using tophat) (this has already been done to save time)
b) assemble the transcripts (using cufflinks) (this has already been done to save time)
c) merge the assemblies (using cuffmerge) (this has already been done to save time)
d) compute differential expression (using cuffdiff) (this has already been done to save time)
e) inspect the results
f) examine the differential expression results (using unix, IGV and cummeRbund)
Extra: g) compare assembled transcripts to annotated transcripts to identify potentially novel ones (using cuffcmp)
a) and b) Running tophat and cufflinks
On lonestar, to run tophat, cufflinks etc, following modules need to be loaded.
module load boost/1.45.0 module load bowtie module load tophat module load cufflinks/2.0.2
tophat [options] <bowtie_index_prefix> <reads1> <reads2>
Look at run_commands/tophat.commands to see how it was run.
-rwxr-xr-x 1 daras G-803889 323M Aug 16 11:47 accepted_hits.bam -r-xr-xr-x 1 daras G-803889 237K Aug 16 11:46 accepted_hits.bam.bai -rwxr-xr-x 1 daras G-803889 52 Aug 16 11:46 deletions.bed -rwxr-xr-x 1 daras G-803889 54 Aug 16 11:46 insertions.bed -rwxr-xr-x 1 daras G-803889 2.9M Aug 16 11:46 junctions.bed -rwxr-xr-x 1 daras G-803889 70 Aug 16 11:46 left_kept_reads.info drwxr-xr-x 2 daras G-803889 32K Aug 16 11:46 logs -rwxr-xr-x 1 daras G-803889 70 Aug 16 11:46 right_kept_reads.info -rwxr-xr-x 1 daras G-803889 9.7K Aug 16 11:46 unmapped_left.fq.z -rwxr-xr-x 1 daras G-803889 9.9K Aug 16 11:46 unmapped_right.fq.z
cufflinks [options] <hits.bam>
Look at run_commands/cufflinks.commands to see how tophat and cufflinks were run.
-rwxr-xr-x 1 daras G-803889 14M Aug 16 12:49 transcripts.gtf -rwxr-xr-x 1 daras G-803889 597K Aug 16 12:49 genes.fpkm_tracking -rwxr-xr-x 1 daras G-803889 960K Aug 16 12:49 isoforms.fpkm_tracking -rwxr-xr-x 1 daras G-803889 0 Aug 16 12:33 skipped.gtf
Exercise 1: If I wanted to provide a trancript annotation file (gtf file) in cufflinks command, what would I add to the command? Remember that you can type the command and hit enter to get help info about it.
i. If I wanted cufflinks to not assemble any novel transcripts outside of what is in the gtf file
ii. If I wanted cufflinks to assemble novel as well as annotated transcripts?
c) Merging assemblies using cuffmerge
find . -name transcripts.gtf > assembly_list.txt cuffmerge <assembly_list.txt>
The most important file is: -r-xr-xr-x 1 daras G-803889 32090408 Aug 16 20:11 merged.gtf
d) Finding differentially expressed genes and isoforms using cuffdiff
cuffdiff [options] <merged.gtf> <sample1_rep1.bam,sample1_rep2.bam> <sample2_rep1.bam,sample2_rep2.bam>
Exercise 2: What does cuffdiff -b do?
We will be inspecting the cuffdiff results further in a little bit.
e) Inspect the mapped results
Before you even start, do a sanity check on your data by examining the bam files from the mapping output.
I've included the directory "bwa_genome" containing the output from bwa of the C1_R1_1 and C1_R1_2 files mapped directly to the genome using bwa. Tophat output is in C1_R1_thout for C1_R1 sample. Report the total number of mapped reads for each sample using "samtools flagstat" for the bwa output (bam file) and the tophat output (bam file). There is a big difference between the bam file from tophat and the bam file from bwa. Can you spot it?
You might want to load the bwa bam file on IGV alongside the tophat bam file for the same sample to see the differences between mapping to the transcriptome and mapping to the genome.
Genome track
- Use D.melanogaster (dm3)
Data files (use Load from URL option)
- Load tophat bam file from URL: http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/C1_R1_accepted_hits.bam
- Load bwa bam file from URL: http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/bwa_genome/C1_R1.bam
- To load another tophat bam file from URL: http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/C2_R1_accepted_hits.bam
If you would like, you can also load another bam file output from the tophat run (remember, tophat calls bowtie for mapping). These are large file, so it may slow your computer down. Since we've already loaded C1_R1, let's load one from the C2 condition, say, C2_R1, to see genuine expression level changes.
Keep this open to come back to, when we further explore differential expression results.
f) Examine the differential expression analysis results
The cuffdiff output is in a directory called diff_out. We are going to spend some time parsing through this output. So, copy it over to your scratch directory and move to your SCRATCH directory.
cp -r diff_out $SCRATCH cds ls diff_out
Note that cuffdiff has performed a statistical test on the expression values between our two biological groups. It reports the FPKM expression levels for each group, the log2(group 1 FPKM/ group 2 FPKM), and a p-value measure of statistical confidence, among many other helpful data items.
-rwxr-x--- 1 daras G-801020 2691192 Aug 21 12:20 isoform_exp.diff : Differential expression testing for transcripts -rwxr-x--- 1 daras G-801020 1483520 Aug 21 12:20 gene_exp.diff : Differential expression testing for genes -rwxr-x--- 1 daras G-801020 1729831 Aug 21 12:20 tss_group_exp.diff: Differential expression testing for primary transcripts -rwxr-x--- 1 daras G-801020 1369451 Aug 21 12:20 cds_exp.diff : Differential expression testing for coding sequences -rwxr-x--- 1 daras G-801020 3277177 Aug 21 12:20 isoforms.fpkm_tracking -rwxr-x--- 1 daras G-801020 1628659 Aug 21 12:20 genes.fpkm_tracking -rwxr-x--- 1 daras G-801020 1885773 Aug 21 12:20 tss_groups.fpkm_tracking -rwxr-x--- 1 daras G-801020 1477492 Aug 21 12:20 cds.fpkm_tracking -rwxr-x--- 1 daras G-801020 1349574 Aug 21 12:20 splicing.diff : Differential splicing tests -rwxr-x--- 1 daras G-801020 1158560 Aug 21 12:20 promoters.diff : Differential promoter usage -rwxr-x--- 1 daras G-801020 919690 Aug 21 12:20 cds.diff : Differential coding output.
Here is a basic command useful for parsing/sorting the gene_exp.diff
or isoform_exp.diff
files:
cat isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head
Exercise 3: Find the 10 most up-regulated genes, by fold change that are classified as significantly changed. Look at one example of a up-regulated gene, regucalcin, on IGV.
Exercise 4: Find the 10 most up-regulated isoforms, by fold change that are classified as significantly changed. What genes do they belong to?
Using cummeRbund to inspect differential expression data.
Step 1: Load R and enter R environment
module load R R
Step 2: Within R environment, set up cummeRbund
source("http://bioconductor.org/biocLite.R") biocLite("cummeRbund")
Step 3: Load cummeRbund library and read in the differential expression results.
library(cummeRbund) cuff_data <- readCufflinks('diff_out')
Step 4: Use cummeRbund to visualize the differential expression results
gene_diff_data <- diffData(genes(cuff_data)) sig_gene_data <- subset(gene_diff_data, (significant == 'yes')) nrow(sig_gene_data) isoform_diff_data <-diffData(isoforms(cuff_data)) sig_isoform_data <- subset(isoform_diff_data, (significant == 'yes')) nrow(sig_isoform_data)
csScatter(genes(cuff_data), 'C1', 'C2')
mygene1 <- getGene(cuff_data,'regucalcin') expressionBarplot(mygene1) expressionBarplot(isoforms(mygene1))
mygene2 <- getGene(cuff_data, 'Rala') expressionBarplot(mygene2) expressionBarplot(isoforms(mygene2))
Take cummeRbund for a spin...
CummeRbund is powerful package with many different functions. Above was an illustration of a few of them. Try any of the suggested exercises below to further explore the differential expression results with different cummeRbund functions.
If you would rather just look at the resulting graphs ,they are at the URL: http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/ as exercise5_Rplots.pdf, exercise6_Rplots.pdf, and exercise7_Rplots.pdf.
You can refer to the cummeRbund manual for more help and remember that ?functionName will provide syntax information for different functions.
http://compbio.mit.edu/cummeRbund/manual.html
You may need to redo Step 3. everytime you reopen an R session.
Exercise 5: Visualize the distribution of fpkm values across the two different conditions using a boxplot.
Exercise 6: Visualize the significant vs non-significant genes using a volcano plot.
Exercise 7: Pull out a subset of the genes using a ln_fold_change greater than 1.5. Generate expression bar plots for just those genes.
If you have trouble sourcing cummeRbund, try this:
module swap gcc intel
Reenter R and redo above steps.