Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...


Code Block
titleStarting R and loading modules after they are installed
login1$ R

library('DESeq2')
library('readr')
library('tximport')
library('rhdf5')

DESeq2

Input:

DESeq2 takes as input count data in several forms: a table form, with each column representing a biological replicate/biological condition. DESEQ2 can  also read data directly from htseq results, so we can use the 6 files we generated using htseq as input for DESeq2. The count data must be raw counts of sequencing reads, not already normalized data. 

Example:

                  C1_R1        C1_R2        C1_R3               C2_R1         C2_R2       C2_R2

        FBgn0000003    0                       0                       0                       0                       0                    0                  
        FBgn0000008    92                     161                   76                     70                     140                88                
        FBgn0000014    5                       1                       0                       0                       4                    0        

     

DESeq2 (with the use of an additional packages called tximport and readr) can read data directly from kallisto abundance files.  We will need to provide the location of the 6 abundance files, the sample names associated to each file and a Sample Table that gives the mapping between sample and condtion.

SampleName    Condition


DESeq2 Scripts:

  1. DESeq2 script to work with kallisto count output is provided below.

Code Block
titleDESEQ2 KALLISTO script
#load libraries
library(tximport)
library(readr)
library("DESeq2")
library('rhdf5')

#Import a file called file_list with all the locations of the abundance.tsv files
#eg below:
#/stor/SCRATCH/sample1/abundances.tsv
#/stor/SCRATCH/sample2/abundances.tsv
#/stor/SCRATCH/sample3/abundances.tsv
#/stor/SCRATCH/sample4/abundances.tsv
files<-as.character(read.table("file_list", header=FALSE)$V1)
#Import a file called samples with the sample names corresponding to each file in the file_list
#eg below:
#sample1
#sample2
#sample3
#sample4
#look at the data structures
files

samples<-as.character(read.table("samples",header=FALSE)$V1)
names(files)<-samples
#look at the data structures
samples
files
#IMPORTANT: MAKE SURE THE SAMPLES AND FILE_LIST ARE IN THE SAME ORDER- SAMPLES SHOULD MATCH UP WITH FILES
samples==rownames(sampleTable) #should return TRUE for all

#Import a file called sampletable which is a tab-delimited file that contains each samplename along with the condition
#eg below:
#samples        condition
#sample1        alc
#sample2        alc
#sample3        con
#sample4        con
sampleTable <-read.table("sampleTable",header=TRUE)
#look at the data structure


#Import a file called tx2gene.csv which a csv file that contains the transcript id to gene id mapping
#For Drosophila, this is located at: tx2gene.csv
tx2gene <- read.csv("tx2gene.csv")
#look at this data structure
#read in kallisto abundance files, summarizing by gene
txi <- tximport(files, type = "kallisto", tx2gene = tx2gene)
names(txi)


#make a deseq2 object from the kallisto summarized counts
ddsMatrix <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)
ddsMatrix


#Optionally, if you want to save this count matrix as a file
write.csv(assay(ddsMatrix), file="genecounts.raw.csv")


#Optionally, if you want to do variance stabilizing transformation or regularized log transformation on this count matrix and then save as a file: This can become input to things like wgcna, pca
vsd<- vst(ddsMatrix)
write.csv(assay(vsd), file="genecounts.variancestabilized.csv")
 
#estimate size factors, dispersion, normalize and perform negative binomial test to compare across conditions
dds<-DESeq(ddsMatrix)
#collect results from the statistical testing, order by adjusted pvalue and write into an output file
res<-results(dds)
resOrdered <- res[order(res$padj),]
summary(res)
write.csv(resOrdered, "deseq2_kallisto_C1_vs_C2.csv")
 
#generate MA plot
jpeg('MAPlot.jpg')
plotMA(dds,ylim=c(-2,2),main="DESeq2")
dev.off()



#save the R data and history
save.image(file="deseq2.kallisto.Rdata")
savehistory(file="deseq2.kallisto.Rhistory")
Code Block
titleRun the R script
R CMD BATCH deseq2.kallisto.R


2. DESeq2 script to work with Htseq count output

Code Block
titleDESEQ2 HTSEQ Script
library("DESeq2")
 
#GET HTSEQ COUNTS AND SET UP SAMPLE TABLE
directory<-(getwd())
samples <- c("C1_count1.gff", "C1_count2.gff", "C1_count3.gff", "C2_count4.gff", "C2_count5.gff", "C2_count6.gff")
conditions <- c("control", "control", "control", "treated","treated","treated")
sampleTable<-data.frame(sampleName=samples, fileName=samples, condition=conditions)
sampleTable


#BUILD A DESEQ2 OBJECT FROM THE HTSEQ COUNTS DATA
ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~condition)
colData(ddsHTSeq)$condition<-factor(colData(ddsHTSeq)$condition, levels=c("control", "treated"))
 
 
#LOOK AT THE DESEQ2 OBJECT WE'VE CREATED BY READING IN HTSEQ COUNT FILES
ddsHTSeq
 
#RUN THE STATISTICAL TEST IN ONE GO- NORMALIZATION, ESTIMATE DISPERSION/VARIANCE AND DO TEST FOR DIFFERENTIAL EXPRESSION 
dds<-DESeq(ddsHTSeq)
res<-results(dds)
res<-res[order(res$padj),]
mcols(res,use.names=TRUE)
summary(res)
 
#GENERATE MA PLOT
jpeg('MAPlot_htseq.jpg')
plotMA(dds,ylim=c(-2,2),main="DESeq2") 
dev.off()
 
#WRITE RESULTS INTO FILE
write.csv(as.data.frame(res),file="deseq2_htseq_C1_vs_C2.csv")
Code Block
titleRun the R script
#IF YOU DIDNT WANT TO RUN THE SCRIPT INTERACTIVELY
R CMD BATCH deseq2.htseq.R


Lets look at our results

Code Block
titleLook at results
head results/deseq2_kallisto_C1_vs_C2.csv

Find the top 10 upregulated genes

Code Block
titleFind the top 10 upregulated genes
#DESeq2 results
sed 's/,/\t/g' results/deseq2_kallisto_C1_vs_C2.csv|sort -n -r -k3,3|cut -f 1,3|head

 
#Notice the idiosyncracy with sort
sed 's/,/\t/g' results/deseq2_kallisto_C1_vs_C2.csv|sort -n -r -k3,3|grep -v 'e-0'|cut -f 1,3|head

Find the top 10 downregulated genes

Code Block
titleFind the top 10 upregulated genes
#DESeq2 results

sed 's/,/\t/g' results/deseq2_kallisto_C1_vs_C2.csv|sort -n -k3,3|cut -f 1,3|head
 
#Notice the idiosyncracy with sort
sed 's/,/\t/g' results/deseq2_kallisto_C1_vs_C2.csv|sort -n -k3,3|grep -v 'e-0'|cut -f 1,3|head

2.  Select DEGs with following cut offs-  Fold Change >=2 (or <= -2) (this means log 2 fold change >= 1 or <=-1) and adj p value < 0.05 and count how many DEGs we have

Code Block
titleCount the number of DEG
#DESeq2 results
sed 's/,/\t/g' results/deseq2_kallisto_C1_vs_C2.csv|awk '{if ((($3>=1)||($3<=-1))&&($7<=0.05)) print $1,$3,$7}'|wc -l


If you wanted to use DESeq2 for more complicated designs (with multiple factors, multiple levels), you can by adjusting two things: design and contrast.

Advanced options


#One factor
ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~condition)
 
#Two factors (one factor has multiple levels): condition (control, treated), and sequencing type (single, paired, matepair)
ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~type + condition)
 
#Fold changes will by default be provided for condition (treated vs control)
res <- results(dds)
#To view fold changes for sequencing type, use contrast
res <- results(dds, contrast = c("type""paired""single"))
res <- results(dds, contrast = c("type""matepair""single"))


DEXSeq

This package is meant for finding differential exon usage between samples from different conditions.

Relative usage of an exon  =   transcripts from the gene that contain this exon / all transcripts from the gene 

For each exon (or part of an exon) and each sample :

  1. count how many reads map to this exon
  2. count how many reads map to other exons of the same gene.
  3. calculate ratio of 1 to 2. 
  4. Look for changes in this ratio across conditions
  5. Look for statistically significant changes in this ratio across conditions, by using replicates.

This lets you identify changes in alternative splicing, changes in usage of alternative transcript start sites.

Ballgown

Ballgown (a part of the new tuxedo suite) is a popular tool for testing for differential expression.  We will cover this along with the rest of the tuxedo suite. 

BACK TO COURSE OUTLINE