Differential expression with splice variant analysis aug2012

Differential expression with splice variant analysis at the same time: the Tophat/Cufflinks workflow

In this lab, you will explore a fairly typical RNA-seq analysis workflow. Simulated RNA-seq data will be provided to you; the data contains 50 bp paired-end reads that originated from Drosophila and have been generated in silico to replicate real gene count data. The data simulates two biological groups with three biological replicates per group (6 samples total). Your job is to determine what genes, if any, are significantly differentially expressed, to see if there is evidence of alternative splicing between the two samples, and if there is alternative splicing, to determine if that gene has some known disease association.

Your workflow would be:

a) map reads against the Drosophila genome (using tophat) (this has already been done to save time)
b) assemble the putative transcripts (using cufflinks) (this has already been done to save time)
c) view the putative transcripts using IGV
d) compute differential expression (using cuffdiff) and find genes and isoforms significantly differentially expressed (by parsing output)
e) examine some differential splicing
f) compare assembled transcripts to annotated transcripts to identify potentially novel ones (using cuffcmp)
g) look up the gene that shows differential splicing in the UCSC browser and determine what tissue these RNA samples probably came from.

You will almost certainly need to consult the documentation for tophat and cufflinks:
http://tophat.cbcb.umd.edu/manual.html
http://cufflinks.cbcb.umd.edu/manual.html

Six raw data files were provided as the starting point: c1_r1, c1_r2, c1_r3 from the first biological group, and c2_r1, c2_r2, and c2_r3 from the second.

Due to the size of the data and length of run time, *all the programs have already been run for this exercise*. Shell scripts and their log files are provided so you can see the actual commands used and run output. You will be parsing the output, finding answers, and visualizing results.

Download the data: http://web.corral.tacc.utexas.edu/GSAF/lab9.tar.gz, unzip and untar it on your local computer, not Lonestar:

Download completed tophat analyzed data into a new directory

wget http://web.corral.tacc.utexas.edu/GSAF/lab9.tar.gz
gunzip lab9.tar.gz
tar -xvf lab9.tar
cd tophat

Find and inspect your data: 35 points

The file "tophat.sh" was used to run tophat and cufflinks on the input fasta files. The log files in the main tophat directory contain the output from the respective tophat runs, while the file "tophat.sh.log" contains the output from all the cufflinks runs. Use this information to figure out where the output you need has wound up.

Before you even start, do a sanity check on your data by examining the bam files from the mapping output.

I've included the directory "bwa_genome" containing the output from bwa of the mut1.1.fasta file mapped directly to the genome using bwa. You might want to load this into IGV alongside the tophat/bowtie bam file for the same sample to see the differences between mapping to the transcriptome and mapping to the genome (like you lose a bunch of reads).

Report the total number of input reads and the total number of mapped reads for each sample using "samtools flagstat". For mut1.1, compare the percentage mapped to the genome to the percentage mapped via tophat/bowtie to the genome.

For c) figure out where the output .bed and .bam files are for each of the input fasta files.

Once you've figured that out, load all 6 junction.bed files into IGV (you might want to change the names of the files so you can remember which is which). From here, you can zoom in to a region where there is data (look for the peaks in your .bed file input).

Now, load ONE of the bam file outputs from the tophat run (remember, tophat calls bowtie for mapping). These are large files; you can load more than one, just be prepared that it may slow your computer down. If you want to load two, I'd suggest loading one from mut1 and one from mut2 so you see genuine expression level changes.

Reminder on how to download and run IGV

Downloading and running IGV

wget http://www.broadinstitute.org/igv/projects/downloads/IGV_2.1.14.zip
unzip IGV_2.1.14.zip
cd IGV_2.1.14
java -Xmx2g -jar igv.jar

Optional - load data into UCSC genome browser

Now, Google "UCSC genome browser", go there, and get to the human genome view so you can load your data into the UCSC browser.

For now, we'll just load the bed files which are very small. Select, "add custom tracks" and figure out how to add the junction.bed file from the mut1 set. Notice that coordinates are identical between IGV and UCSC - you can copy-and-paste your location from IGV into UCSC - do that and confirm that you see the isoforms from your custom tracks. Now "manage custom tracks" so you can label the track "mut1". While you're there, also load one of the mut2 .bed files.

Differential expression analysis: 50 points

For d), see the file "cuffdiff.sh" and "cuffdiff.sh.log" for the input, command line, and output. Then, find the cuffdiff output (either by understanding cuffdiff.sh or by looking) and by looking at it and/or reading the documentation find the isoforms and genes that are differentially expressed. Note that cuffdiff has performed a statistical test on the expression values between our two biological groups. It reports the FPKM expression levels for each group, the log2(group 1 FPKM/ group 2 FPKM), and a p-value measure of statistical confidence, among many other helpful
data items.

25 points: Find the two genes that are differentially expressed at the gene level, but not at the isoform level. This can be done with a python or perl script, or with a one-line linux command.

Solution

ENSG00000121388
ENSG00000154359

Hint

One-line command to identify DE genes but not isoforms

cat gene_exp.diff isoform_exp.diff | grep yes | awk '{print $2}' | sort | uniq -u

Bonus points: explain why there is a difference.

Solution

The two exons missing in at least some isoforms of mut1 are the left-most two:

chr6:29,569,940-29,570,922
chr6:29,576,972-29,577,216

Hint

Use a visualization tool, and if it’s IGV, using “load from server -> Annotations -> Genes -> Ensemble genes” will make this task much easier.

Here is a basic command useful for parsing/sorting the gene_exp.diff or isoform_exp.diff files:

Linux one-liner for sorting cuffdiff output by log2 fold-change values

cat isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head

Note that you can sort by a different column, like 12 (p-value) instead of 10; sort can also sort on more than one column.

Now go to the GABBR1 gene in either IGV or UCSC browsers. Hint: you can type a gene name directly into the coordinate box on either browser and if it can find a match it will take you there.

25 points: Identify the two exons of GABBR1 that appear to not be expressed in at least some isoforms of mut1.

Data Interpretation: 15 points

For f), go to USCS and load the OMIM Genes track, then use the browser to link to the OMIM entry related to the GABBR1 gene.

For 15 points, read the OMIM entry sufficiently to report the likely organ we extracted RNA from to generate this data set.

Solution

Brain

Space shortcuts

Page tree

Differential expression with splice variant analysis at the same time: the Tophat/Cufflinks workflow

Your workflow would be:

Find and inspect your data: 35 points

Before you even start, do a sanity check on your data by examining the bam files from the mapping output.

Differential expression analysis: 50 points

Data Interpretation: 15 points