Clustering using WGCNA

Overview: The WGCNA package (in R) uses functions that perform a correlation network analysis of large, high-dimensional data sets (RNAseq datasets). This unbiased approach clusters similarly expressed genes into groups (termed 'modules') which are then correlated with quantitative or categorical traits measured in the experiment. Modules can be further analyzed using GO, KEGG or KOG, VisANT and Cytoscape. This approach goes beyond a simple 'list of genes' and helps tease apart large, complicated RNAseq datasets into functional clusters that are more easy to interpret.

Figure 1: Overview of WGCNA package

Download sample data files and R code here: WGCNA.zip

Explanation of sample dataset: Time series of coral larval development from 4 hours post fertilization (Day 0) to 245 hours post fertilization (Day 12). Multiple other quantitative traits were measured through the time series. Only green and red fluorescence are added as quantitative traits in the sample dataset. Dataset has 48 samples total, four replicates (A-D) over 12 days. The goal is to find genes that correlate with developmental traits through time and differences in gene expression between early larval development and late larval development.

Download PPT: WGCNA2015.pptx

Step 1: upload data into R and reformat for WGCNA

Load data into WGCNA and reformat

#install packages and load libraries
source("http://bioconductor.org/biocLite.R")
biocLite("WGCNA")
install.packages("flashClust") 
library(WGCNA)
library(flashClust)
#STEP 1: uploading data into R and formatting it for WGCNA

#set your current working directory (where all your files are)
setwd("")

#this reads in the normalized counts file from DESeq2
time=read.csv("SampleTimeSeriesRLD.csv")

#Adjust file so it matches format WGCNA needs 
time=as.data.frame(time)
rownames(time) <-time$X
time$X=NULL
datExpr= as.data.frame(t(time[,]))
names(datExpr)= row.names(time)
rownames(datExpr)=names(time)
dim(datExpr)
#run this to check if there are gene outliers
gsg=goodSamplesGenes(datExpr, verbose = 3)
gsg$allOK 
#If the last statement returns TRUE, all genes have passed the cuts. If not, we remove the offending genes and samples from the data with the following:
#if (!gsg$allOK)
#	{if (sum(!gsg$goodGenes)>0)
#		printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse= ", ")));
#		if (sum(!gsg$goodSamples)>0)
#			printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse=", ")))
#		datExpr0= datExpr0[gsg$goodSamples, gsg$goodGenes]
#		}
#upload your trait data
datTraits= read.csv("Traits_23May2015.csv")
head(datTraits)

#form a data frame analogous to expression data that will hold the clinical traits.
rownames(datTraits) <- datTraits$Sample
datTraits$Sample <- NULL
table(rownames(datTraits)==rownames(datExpr)) #should return TRUE if datasets align correctly, otherwise your names are out of order

#expression data is in datExpr, corresponding clinical traits are datTraits

sampleTree2=flashClust(dist(datExpr), method="average")
traitColors= numbers2colors(datTraits, signed= FALSE)
plotDendroAndColors(sampleTree2, traitColors, groupLabels= names(datTraits), main="Sample Dendrogram and Trait heatmap")


#output data from R
save(datExpr, datTraits, file="SamplesAndTraits.RData")

Figure 2: Clustering of samples and traits. Day 0-12 are categorical traits (1 or 0). Hour post fertilization (HPF), RedFluoro, and Green Fluoro are quantitative traits measured for each sample.

At this point you will need to identify sample outliers and choose a soft threshold power. These are easy to do and are well documented in the online tutorials. Scripts for choosing a soft threshold are commented out in the attached R file. It's important to choose the correct soft threshold for your dataset.

Figure 3: Soft Thresholding: from this plot, we would choose a power of 12 since it's the lowest power for which the scale free topology index reaches 0.90 (red line)

Step 2: Construct a gene co-expression network and identify modules

Figure 4: Clustering dendrogram of all genes, with dissimilarities based on topological overlap. Each vertical line represents a single gene. Assigned module colors below.

Step 3: Relate modules to external traits

Figure 5: Module-Trait relationships. Color scale (red-blue) represents the strength of the correlation between the module and the trait. For example, the turquoise module is highly significantly correlated with HPF, RedFluoro and GreenFluoro. Each box gives a correlation value (R^2) followed by p-value (in parenthesis).

Step 4: Characterize modules and relationships with traits

Figure 6: Heatmap of genes within the 'dark orange' module (from entire dataset, not sample dataset) showing up (red) or down (green) regulation through larval development.

Space shortcuts

Page tree