In this lab, you will explore a fairly typical RNA-seq analysis workflow. Simulated RNA-seq data will be provided to you; the data contains 75 bp paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). This simulated data has already been run through a basic RNA-seq analysis workflow. We will look at:
i. How the workflow was run and what steps are involved.
ii. What genes and isoforms are significantly differentially expressed
Six raw data files were provided as the starting point:
c1_r1, c1_r2, c1_r3 from the first biological condition, and c2_r1, c2_r2, and c2_r3 from the second.
Due to the size of the data and length of run time, *most of the programs have already been run for this exercise*. The commands run are in different *.commands files. We will spend some time looking through these commands to understand them. You will then be parsing the output, finding answers, and visualizing results.
An overview of tophat cufflinks workflow
This section is all located under $BI/ngs_course/tophat_cufflinks/ So cd into this directory now.
Commands used are in different *.commands files located in $BI/ngs_course/tophat_cufflinks/run_commands
Some output, like bam files can be gotten from the URL http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/ and directly loaded into IGV, using Load from URL.
If you generate your own output and would like to view them on IGV, you will need to scp the files from lonestar to your computer.
On your computer's side: Go to the directory where you want to copy files to.
Replace the "/home/..." with the "pwd" information obtained earlier. This command would transfer "stuff.fastq" from the specified directory on Lonestar to your current directory on your computer. |
For help, remember to type the commands and hit enter to see what help they can offer you. You will also almost certainly need to consult the documentation for tophat, cufflinks and cummeRbund:
http://tophat.cbcb.umd.edu/manual.html
http://cufflinks.cbcb.umd.edu/manual.html
http://compbio.mit.edu/cummeRbund/manual.html
a) map reads against the Drosophila genome (using tophat) (this has already been done to save time)
b) assemble the putative transcripts (using cufflinks) (this has already been done to save time)
c) merge the assemblies (using cuffmerge) (this has already been done to save time)
d) compute differential expression (using cuffdiff) (this has already been done to save time)
e) inspect the results
f) examine the differential expression results (using unix, IGV and cummeRbund)
Extra: g) compare assembled transcripts to annotated transcripts to identify potentially novel ones (using cuffcmp)
cat $BI/tophat_cufflinks/run_commands/tophat.commands |
On lonestar, to run tophat, cufflinks etc, following modules need to be loaded.
module load boost/1.45.0 module load bowtie module load tophat module load cufflinks/2.0.2 |
tophat [options] <bowtie_index_prefix> <reads1> <reads2> |
-rwxr-xr-x 1 daras G-803889 323M Aug 16 11:47 accepted_hits.bam -r-xr-xr-x 1 daras G-803889 237K Aug 16 11:46 accepted_hits.bam.bai -rwxr-xr-x 1 daras G-803889 52 Aug 16 11:46 deletions.bed -rwxr-xr-x 1 daras G-803889 54 Aug 16 11:46 insertions.bed -rwxr-xr-x 1 daras G-803889 2.9M Aug 16 11:46 junctions.bed -rwxr-xr-x 1 daras G-803889 70 Aug 16 11:46 left_kept_reads.info drwxr-xr-x 2 daras G-803889 32K Aug 16 11:46 logs -rwxr-xr-x 1 daras G-803889 70 Aug 16 11:46 right_kept_reads.info -rwxr-xr-x 1 daras G-803889 9.7K Aug 16 11:46 unmapped_left.fq.z -rwxr-xr-x 1 daras G-803889 9.9K Aug 16 11:46 unmapped_right.fq.z |
cufflinks [options] <hits.bam> |
-rwxr-xr-x 1 daras G-803889 14M Aug 16 12:49 transcripts.gtf -rwxr-xr-x 1 daras G-803889 597K Aug 16 12:49 genes.fpkm_tracking -rwxr-xr-x 1 daras G-803889 960K Aug 16 12:49 isoforms.fpkm_tracking -rwxr-xr-x 1 daras G-803889 0 Aug 16 12:33 skipped.gtf |
Exercise 1: If I wanted to provide a trancript annotation file (gtf file) in cufflinks command, what would I add to the command? Remember that you can type the command and hit enter to get help info about it.
i. If I wanted cufflinks to not assemble any novel transcripts outside of what is in the gtf file
ii. If I wanted cufflinks to assemble novel as well as annotated transcripts?
-G to use only the annotated transcripts in the gtf file |
find . -name transcripts.gtf > assembly_list.txt cuffmerge <assembly_list.txt> |
The most important file is: -r-xr-xr-x 1 daras G-803889 32090408 Aug 16 20:11 merged.gtf |
cuffdiff [options] <merged.gtf> <sample1_rep1.bam,sample1_rep2.bam> <sample2_rep1.bam,sample2_rep2.bam> |
Exercise 2: What does cuffdiff -b do?
-b is for enabling fragment bias correction. |
We will be inspecting the cuffdiff results further in a little bit.
I've included the directory "bwa_genome" containing the output from bwa of the C1_R1_1 and C1_R1_2 files mapped directly to the genome using bwa. Tophat output is in C1_R1_thout for C1_R1 sample. Report the total number of mapped reads for each sample using "samtools flagstat" for the bwa output (bam file) and the tophat output (bam file). There is a big difference between the bam file from tophat and the bam file from bwa. Can you spot it?
|
You might want to load the bwa bam file on IGV alongside the tophat bam file for the same sample to see the differences between mapping to the transcriptome and mapping to the genome.
Genome track
Data files (use Load from URL option)
Now, lets load another bam file output from the tophat run (remember, tophat calls bowtie for mapping). These are large file, so it may slow your computer down. Since we've already loaded C1_R1, let's load one from the C2 condition, say, C2_R1, to see genuine expression level changes. Go to gene regucalcin (remember, you can just type the gene name to get there) and see if it is differentially expressed between C1_R1 and C2_R1.
Keep this open to come back to, when we further explore differential expression results.
|
The cuffdiff output is in a directory called diff_out. We are going to spend some time parsing through this output. So, copy it over to your scratch directory and move to your SCRATCH directory.
cp \-r diff_out $SCRATCH cd $SCRATCH/diff_out |
Note that cuffdiff has performed a statistical test on the expression values between our two biological groups. It reports the FPKM expression levels for each group, the log2(group 1 FPKM/ group 2 FPKM), and a p-value measure of statistical confidence, among many other helpful data items.
The most important file is: -rwxr-x--- 1 daras G-801020 1483520 Aug 21 12:20 gene_exp.diff-rwxr-x--- 1 daras G-801020 2691192 Aug 21 12:20 isoform_exp.diff-rwxr-x--- 1 daras G-801020 1729831 Aug 21 12:20 tss_group_exp.diff-rwxr-x--- 1 daras G-801020 1369451 Aug 21 12:20 cds_exp.diff -rwxr-x--- 1 daras G-801020 919690 Aug 21 12:20 cds.diff -rwxr-x--- 1 daras G-801020 1628659 Aug 21 12:20 genes.fpkm_tracking -rwxr-x--- 1 daras G-801020 3277177 Aug 21 12:20 isoforms.fpkm_tracking -rwxr-x--- 1 daras G-801020 1885773 Aug 21 12:20 tss_groups.fpkm_tracking -rwxr-x--- 1 daras G-801020 1477492 Aug 21 12:20 cds.fpkm_tracking -rwxr-x--- 1 daras G-801020 1158560 Aug 21 12:20 promoters.diff -rwxr-x--- 1 daras G-801020 1349574 Aug 21 12:20 splicing.diff -rwxr-x--- 1 daras G-801020 919690 Aug 21 12:20 cds.diff |
Here is a basic command useful for parsing/sorting the gene_exp.diff
or isoform_exp.diff
files:
cat isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head |
Exercise 3: Find the 10 most up-regulated genes, by fold change. Look at one example of a up-regulated gene, regucalcin, on IGV. Find the 10 most down-regulated genes, by fold change. Look at one example of a down-regulated gene, on IGV.
Top 10 upregulated genes Top 10 downregulated genes |
|
Exercise 4: Find the 10 most up-regulated isoforms, by fold change. What genes do they belong to?
simj |
|
Step 1: Load R and enter R environment
module load R R |
Step 2: Within R environment, set up cummeRbund
source("http://bioconductor.org/biocLite.R") biocLite("cummeRbund") |
Step 3: Load cummeRbund library and read in the differential expression results.
library(cummeRbund) cuff_data <- readCufflinks('diff_out') |
Step 4: Use cummeRbund to visualize the differential expression results
gene_diff_data <- diffData(genes(cuff_data)) sig_gene_data <- subset(gene_diff_data, (significant == 'yes')) nrow(sig_gene_data) isoform_diff_data <-diffData(isoforms(cuff_data), 'C1', 'C2') sig_isoform_data <- subset(isoform_diff_data, (significant == 'yes')) nrow(sig_isoform_data) |
csScatter(genes(cuff_data), 'C1', 'C2') |
mygene1 <- getGene(cuff_data,'regucalcin') expressionBarplot(mygene1) expressionBarplot(isoforms(mygene1)) |
mygene2 <- getGene(cuff_data, 'Rala') expressionBarplot(mygene2) expressionBarplot(isoforms(mygene2)) |
CummeRbund is powerful package with many different functions. Above was an illustration of a few of them. Try any of the suggested exercises below to further explore the differential expression results with different cummeRbund functions.
If you would rather just look at the resulting graphs ,they are at the URL: http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/ as exercise5_Rplots.pdf, exercise6_Rplots.pdf, and exercise7_Rplots.pdf.
You can refer to the cummeRbund manual for more help and remember that ?functionName will provide syntax information for different functions.
http://compbio.mit.edu/cummeRbund/manual.html
You may need to redo Step 3. everytime you reopen an R session.
Exercise 5: Visualize the distribution of fpkm values across the two different conditions using a boxplot.
csBoxplot(isoforms(cuff_data)) |
Use csBoxplot function on cuff_data object to generate a boxplot of gene or isoform level fpkms. |
Exercise 6: Visualize the significant vs non-significant genes using a volcano plot.
csVolcano(genes(cuff_data), "C1", "C2") |
Use csVolcano function on cuff_data object to generate a volcano plot. |
Exercise 7: Pull out a subset of the genes using a p_value cutoff of 0.01. Generate expression bar plots for just those genes.
myGenes <- getGenes(cuff_data, sig_geneids) |
If you have trouble sourcing cummeRbund, try this:
module swap gcc intel
Reenter R and redo above steps.