...
RPKM = RPK/total no.of reads in million (total no of reads/ 1000000)
DESeq2 | edgeR | DEXSeq | Cuffdiff | |
---|---|---|---|---|
Normalization | Median scaling size factor | TMM | Median scaling size factor | FPKM , but also has provisions for others |
Distribution | Negative binomial | Negative binomial | Negative binomial | Negative binomial |
DE Test | Negative binomial test | Fisher exact test | Modified T test | T test |
Advantages | Straightforward, fast, DESeq2 allows for complicated study designs, with multiple factors
| Straightforward, fast, good with small number of replicates.
| Good for identifying exon-usage changes | Good for identifying isoform-level changes, splicing changes, promotor changes. Not as straightforward, somewhat of a black box |
R and Bioconductor, very briefly...
...
Code Block | ||
---|---|---|
| ||
login1$ R library("DESeq2") These commands will work for any Bioconductor package! |
DESeq2
Input:
DESeq2 takes as input count data in several forms: a table form, with each column representing a biological replicate/biological condition. DESEQ2 can also read data directly from htseq results, so we can use the 6 files we generated using htseq as input for DESeq2. The count data must be raw counts of sequencing reads, not already normalized data.
Example:
C1_R1 C1_R2 C1_R3 C2_R1 C2_R2 C2_R2
FBgn0000003 0 0 0 0 0 0
FBgn0000008 92 161 76 70 140 88
FBgn0000014 5 1 0 0 4 0
We also need a table that gives the following information:
Sample Name Counts File Condition
DESeq2 (with the use of an additional packages called tximport and readr) can read data directly from kallisto abundance files. We will need to provide the location of the 6 abundance files, and a Sample Table that gives the mapping between sample and condtion.
SampleName Condition
Unfortunately, tximport is only available for the R-3.3.0 and above. Since TACC does not have this version of R, you will not be able to currently run this R script on TACC.
DESeq2 Scripts:
- DESeq2 script to work with kallisto count output is provided below.
Code Block | ||
---|---|---|
| ||
#load libraries library(tximport) library(readr) library("DESeq2") #Import a file called file_list with all the locations of the abundance.tsv files #eg below: #/stor/SCRATCH/sample1/abundances.tsv #/stor/SCRATCH/sample2/abundances.tsv #/stor/SCRATCH/sample3/abundances.tsv #/stor/SCRATCH/sample4/abundances.tsv files<-as.character(read.table("file_list", header=FALSE)$V1) #Import a file called samples with the sample names corresponding to each file in the file_list #eg below: #sample1 #sample2 #sample3 #sample4 samples<-as.character(read.table("samples",header=FALSE)$V1) names(files)<-samples #Import a file called sampletable which is a tab-delimited file that contains each samplename along with the condition #eg below: #samples condition #sample1 alc #sample2 alc #sample3 con #sample4 con sampleTable <-read.table("sampleTable",header=TRUE) sampleTable #Import a file called tx2gene.csv which a csv file that contains the transcript id to gene id mapping #For Drosophila, this is located at: tx2gene.csv tx2gene <- read.csv("tx2gene.csv") #read in kallisto abundance files, summarizing by gene txi <- tximport(files, type = "kallisto", tx2gene = tx2gene, reader = read_tsv) names(txi) #make a deseq2 object from the kallisto summarized counts ddsMatrix <- DESeqDataSetFromTximport(txi, sampleTable, ~condition) ddsMatrix #estimate size factors, dispersion, normalize and perform negative binomial test to compare across conditions dds<-DESeq(ddsMatrix) #collect results from the statistical testing, order by adjusted pvalue and write into an output file res<-results(dds) resOrdered <- res[order(res$padj),] summary(res) write.csv(resOrdered, "deseq2_kallisto_C1_vs_C2.csv") #generate MA plot jpeg('MAPlot.jpg') plotMA(dds,ylim=c(-2,2),main="DESeq2") dev.off() |
Code Block | ||
---|---|---|
| ||
R CMD BATCH deseq2.kallisto.R |
2. DESeq2 script to work with Htseq count output
Code Block | ||
---|---|---|
| ||
library("DESeq2") #GET HTSEQ COUNTS AND SET UP SAMPLE TABLE directory<-(getwd()) samples <- c("C1_count1.gff", "C1_count2.gff", "C1_count3.gff", "C2_count4.gff", "C2_count5.gff", "C2_count6.gff") conditions <- c("control", "control", "control", "treated","treated","treated") sampleTable<-data.frame(sampleName=samples, fileName=samples, condition=conditions) sampleTable #BUILD A DESEQ2 OBJECT FROM THE HTSEQ COUNTS DATA ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~condition) colData(ddsHTSeq)$condition<-factor(colData(ddsHTSeq)$condition, levels=c("control", "treated")) #LOOK AT THE DESEQ2 OBJECT WE'VE CREATED BY READING IN HTSEQ COUNT FILES ddsHTSeq #RUN THE STATISTICAL TEST IN ONE GO- NORMALIZATION, ESTIMATE DISPERSION/VARIANCE AND DO TEST FOR DIFFERENTIAL EXPRESSION dds<-DESeq(ddsHTSeq) res<-results(dds) res<-res[order(res$padj),] mcols(res,use.names=TRUE) summary(res) #GENERATE MA PLOT jpeg('MAPlot_htseq.jpg') plotMA(dds,ylim=c(-2,2),main="DESeq2") dev.off() #WRITE RESULTS INTO FILE write.csv(as.data.frame(res),file="deseq2_htseq_C1_vs_C2.csv") |
Code Block | ||
---|---|---|
| ||
R CMD BATCH deseq2.htseq.R |
Lets look at our results
Code Block | ||
---|---|---|
| ||
head results/deseq2_kallisto_C1_vs_C2.csv |
Find the top 10 upregulated genes
Code Block | ||
---|---|---|
| ||
#DESeq2 results sed 's/,/\t/g' results/deseq2_kallisto_C1_vs_C2.csv|sort -n -r -k3,3|cut -f 1,3|head #Notice the idiosyncracy with sort sed 's/,/\t/g' results/deseq2_kallisto_C1_vs_C2.csv|sort -n -r -k3,3|grep -v 'e-0'|cut -f 1,3|head |
Find the top 10 downregulated genes
Code Block | ||
---|---|---|
| ||
#DESeq2 results sed 's/,/\t/g' results/deseq2_kallisto_C1_vs_C2.csv|sort -n -k3,3|cut -f 1,3|head #Notice the idiosyncracy with sort sed 's/,/\t/g' results/deseq2_kallisto_C1_vs_C2.csv|sort -n -k3,3|grep -v 'e-0'|cut -f 1,3|head |
2. Select DEGs with following cut offs- Fold Change >=2 (or <= -2) (this means log 2 fold change >= 1 or <=-1) and adj p value < 0.05 and count how many DEGs we have
Code Block | ||
---|---|---|
| ||
#DESeq2 results sed 's/,/\t/g' results/deseq2_kallisto_C1_vs_C2.csv|awk '{if ((($3>=1)||($3<=-1))&&($6<$7<=0.05)) print $1,$3,$6$7}'|wc -l |
If you wanted to use DESeq2 for more complicated designs (with multiple factors, multiple levels), you can by adjusting two things: design and contrast.
Advanced options
#One factor ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~condition) #Two factors (one factor has multiple levels): condition (control, treated), and sequencing type (single, paired, matepair) ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~type + condition) #Fold changes will by default be provided for condition (treated vs control) res <- results(dds) #To view fold changes for sequencing type, use contrast res <- results(dds, contrast = c( "type" , "paired" , "single" )) res <- results(dds, contrast = c( "type" , "matepair" , "single" )) |
DEXSeq
This package is meant for finding differential exon usage between samples from different conditions.
Relative usage of an exon = transcripts from the gene that contain this exon / all transcripts from the gene
For each exon (or part of an exon) and each sample :
- count how many reads map to this exon
- count how many reads map to other exons of the same gene.
- calculate ratio of 1 to 2.
- Look for changes in this ratio across conditions
- Look for statistically significant changes in this ratio across conditions, by using replicates.
This lets you identify changes in alternative splicing, changes in usage of alternative transcript start sites.
Ballgown
Ballgown (a part of the new tuxedo suite) is a popular tool for testing for differential expression. We will cover this along with the rest of the tuxedo suite.
BACK TO COURSE OUTLINE