In this exercise, we will analyze RNA-seq data to measure changes in gene expression levels between wild-type and mutant strains of the bacterium Listeria monocytogenes.
Download data files
Copy the data files for this example into your $SCRATCH space:
cds cp -r $BI/ngs_course/listeria_RNA_seq/data listeria_RNA_seq
File Name |
Description |
Sample |
---|---|---|
|
Single-end Illumina 36-bp reads |
wild-type, biological replicate 1 |
|
Single-end Illumina 36-bp reads |
ΔsigB mutant, biological replicate 1 |
|
Single-end Illumina 36-bp reads |
wild-type, biological replicate 2 |
|
Single-end Illumina 36-bp reads |
ΔsigB mutant, biological replicate 2 |
|
Reference Genome sequence (FASTA) |
Listeria monocytogenes strain 10403S |
|
Reference Genome features (GFF) |
Listeria monocytogenes strain 10403S |
This data was submitted to the Short Sequence Read Archive (SRA) to accompany this paper:
- Oliver, H.F., et al. (2009) Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs. BMC Genomics 10:641. Pubmed
You can view the data in the ENA SRA here: http://www.ebi.ac.uk/ena/data/view/SRP001753
Installing Bioconductor modules for R
Many of the modules for doing statistical tests on NGS data have been written in the "R" language for statistical computing. If you're not familiar with R, then this section is likely to be a bit confusing. (You might be thinking "Stop with the new languages already guys! Uncle!") To orient you, we are going to run the R command, which launches the R shell inside our terminal. Like the bash shell that we were using, the R shell interprets commands, but now they are R commands rather than bash commands. The prompt changes from login1$
to >
when you are in the R shell, to help clue you in to this fact.
R is the favorite language of pirates.
R is a very common scripting language used in statistics. There are whole courses on using R going on in other SSI classrooms as we speak! Inside the R universe, you have access to an incredibly large number of useful statistical functions (Fisher's exact test, nonlinear least-squares fitting, ANOVA ...). R also has advanced functionality for producing plots and graphs as output. We'll take advantage of all of this here. You are well on your way to becoming denizens of the polyglot bioinformatics community now.
Regrettably, R
is a bit of it's own bizarro world, as far as how its commands work. (Futhermore, Googling "R" to get help can be very frustrating.) The conventions of most other programming and scripting languages seem to have been re-invented by someone who wanted to do everything their own way in R. Just like we wrote shell scripts in bash
, you can write R
scripts that carry out complicated analyses.
Basic rules of R:
- Don't forget: it's
q()
to quit. - For help, type
?command
. Try?read.table
. Theq
key gets you out of help, just like for aman
page. - The left arrow
<-
(less-than-dash) is the same as an equals sign=
. You can use them interchangeably. - The prompt we will sometimes be showing for R is
>
. Don't type this for a command. It is like thelogin1$
at the beginning of the bash prompt when you log in to Lonestar. It just means that you are in theR
shell. - You can type the name of a variable to have its value displayed. Like this...
> x <- 10 + 5 + 6 > x [1] 21
Like other languages, R can be expanded by loading modules. The R equivalent of Bioperl or Biopython is Bioconductor. Bioconductor can do things for you like convert sequences, but where it really shines is in doing statistical tests (where is it second-to-none in this list of languages). Many functions for analyzing microarray data are implemented in R, and this strength has now carried over to the analysis of RNAseq data.
Here's how you install two modules that we will need for this exercise:
login1$ module load R login1$ R R version 2.14.0 (2011-10-31) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-unknown-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > source("http://bioconductor.org/biocLite.R") ... > biocLite("DESeq") ... > biocLite("edgeR") ... > q() Save workspace image? [y/n/c]: n
When you start R later, you can load these modules with just these commands:
login1$ R > library("DESeq") > library("edgeR")
These commands will work for any Bioconductor modules!
Align the reads
For RNA-seq analysis we're mainly counting the reads that align well, so we choose to use bowtie. (You could also use BWA or many other mappers.)
We've done this several times before, so you should be able to come up with the full command lines if you refer back to the original lesson. Be careful we are now mapping single-end reads, so you may have to look at the bowtie help to figure out how to do that!
You will need to first build the index file, just once and in "interactive mode" is fine. Then, you will need to submit a commands
file with four lines to the TACC queue.
Please give the final output files the names: SRR034450.sam
, SRR034451.sam
, SRR034452.sam
, SRR034453.sam
.
- What are the dispersion estimates?
- What was the predominant effect of the mutation on gene expression in this Listeria strain?
Analyze differential gene expression (edgeR)
edgeR is another R package that you can use to do a similar analysis.
These commands use the negative binomial model, calculate the false discovery rate (FDR ~ adjusted p-value), and make a plot similar to the one from DESeq.
login1$ R ... > library("edgeR") > counts = read.delim("gene_counts.tab", header=F, row.names=1) > colnames(counts) = c("wt1", "mut1", "wt2", "mut2") > head(counts) > group <- factor(c("wt","mut","wt","mut")) > dge = DGEList(counts=counts,group=group) > dge <- estimateCommonDisp(dge) > dge <- estimateTagwiseDisp(dge) > et <- exactTest(dge) > etp <- topTags(et, n=100000) > etp$table$logFC = -etp$table$logFC > pdf("edgeR-MA-plot.pdf") > plot( etp$table$logCPM, etp$table$logFC, xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3, col = ifelse( etp$table$FDR < .1, "red", "black" ) ) > dev.off() > write.csv(etp$table, "edgeR-wt-vs-mut.csv")
Note that the "FC" fold change calculated is initially the reverse of that for the DESeq example for the output here. It is wt relative to mut. To fix this, we put a negative in there for the log fold change.
Exercises
- Compare the expression changes predicted by DESeq and edgeR to each other.
- Does edgeR or DESeq predict more significant changes?
Additional Points
- In an actual RNAseq analysis, you might want to trim stray adaptor sequences from your data using a tool like the FASTX-Toolkit, FAR, or cutadapt before aligning.
- You can get a lot more information from RNAseq data than you could from a microarray experiment. You can map transcriptional start sites, areas of unexpected transcription, splice sites, etc. - all because you have full sequence information that we have barely used in this example.