Differential gene expression analysis

Overview

In this exercise, we will analyze RNA-seq data to measure changes in gene expression levels between wild-type and a mutant strain of the bacterium Listeria monocytogenes.

Learning Objectives

Review mapping reads with an example of how to use qsub to map many data sets in parallel on TACC.
Review samtools and SAM/BAM conversion.
How to use bedtools/HTseq to count reads overlapping genes.
Basic use of the R shell and installing BioConductor modules.
Use edgeR/DESeq to perform statistical analyses of differential gene expression.

Preliminary

Download data files

Copy the data files for this example into your $SCRATCH space:

cds
cp -r $BI/ngs_course/listeria_RNA_seq/data listeria_RNA_seq

File Name	Description	Sample
`SRR034450.fastq`	Single-end Illumina 36-bp reads	wild-type, biological replicate 1
`SRR034451.fastq`	Single-end Illumina 36-bp reads	ΔsigB mutant, biological replicate 1
`SRR034452.fastq`	Single-end Illumina 36-bp reads	wild-type, biological replicate 2
`SRR034453.fastq`	Single-end Illumina 36-bp reads	ΔsigB mutant, biological replicate 2
`NC_017544.1.fasta`	Reference Genome sequence (FASTA)	Listeria monocytogenes strain 10403S
`NC_017544.1.gff`	Reference Genome features (GFF)	Listeria monocytogenes strain 10403S

This data was submitted to the Sequence Read Archive (SRA) to accompany this paper:

Oliver, H.F., et al. (2009) Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs. BMC Genomics 10:641. Pubmed

You can view the data in the ENA SRA here: http://www.ebi.ac.uk/ena/data/view/SRP001753

If you want to skip the read alignment step...

To get right to the new stuff, you can copy the mapped read BAM files and the reference sequence files that you will need using these commands:

cds
cp -r $BI/ngs_course/listeria_RNA_seq/mapped_data listeria_RNA_seq

Then, skip the mapping and SAM/BAM conversion, sorting, indexing steps below.

Install Bioconductor modules for R

Many of the modules for doing statistical tests on NGS data have been written in the "R" language for statistical computing. If you're not familiar with R, then this section is probably going to be a bit confusing. (You might be thinking "Stop with the new languages already guys! Uncle!") To orient you, we are going to run the R command, which launches the R shell inside our terminal. Like the bash shell that we normally use, the R shell interprets commands, but now they are R commands rather than bash commands. The prompt changes from login1$ to > when you are in the R shell, to help clue you in to this fact. The R shell is inside the bash shell. So when you quit R, you will be back where you were in the bash shell.

R is the favorite language of pirates.

R is a very common scripting language used in statistics. There are whole courses on using R going on in other SSI classrooms as we speak! Inside the R universe, you have access to an incredibly large number of useful statistical functions (Fisher's exact test, nonlinear least-squares fitting, ANOVA ...). R also has advanced functionality for producing plots and graphs as output. We'll take advantage of all of this here. You are well on your way to becoming denizens of the polyglot bioinformatics community now.

Regrettably, R is a bit of it's own bizarro world, as far as how its commands work. (Futhermore, Googling "R" to get help can be very frustrating.) The conventions of most other programming and scripting languages seem to have been re-invented by someone who wanted to do everything their own way in R. Just like we wrote shell scripts in bash, you can write R scripts that carry out complicated analyses.

Do not copy the > characters in the R examples.

They are the R prompt to remind you which commands are to be run inside the R shell!

Basic rules of R:

Don't forget: it's q() to quit.
For help, type ?command. Try ?read.table. The q key gets you out of help, just like for a man page.
The left arrow <- (less-than-dash) is the same as an equals sign =. You can use them interchangeably.
The prompt we will sometimes be showing for R is >. Don't type this for a command. It is like the login1$ at the beginning of the bash prompt when you log in to Lonestar. It just means that you are in the R shell.
You can type the name of a variable to have its value displayed. Like this...
```
> x <- 10 + 5 + 6
> x
[1] 21
```

Like other languages, R can be expanded by loading modules. The R equivalent of Bioperl or Biopython is Bioconductor. Bioconductor can theoretically do things for you like convert sequences (none of us use it for that), but where it really shines is in doing statistical tests (where is it second-to-none in this list of languages). Many functions for analyzing microarray data are implemented in R, and this strength has now carried over to the analysis of RNAseq data.

Here's how you install two modules that we will need for this exercise:

The install commands may take several minutes to complete. You can read ahead while they run.

Starting R and loading the modules for this tutorial

login1$ module load R
login1$ R

R version 2.14.0 (2011-10-31)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> source("http://bioconductor.org/biocLite.R")
...
> biocLite("DESeq")
...
> biocLite("edgeR")
...
> q()
Save workspace image? [y/n/c]: n

When you start R later, you will not need to re-intall the modules. You can load them with just these commands:

login1$ R
> library("DESeq")
> library("edgeR")

These commands will work for any Bioconductor module!

Create BAM file of mapped reads

Map reads using Bowtie

For RNA-seq analysis we're mainly counting the reads that align well, so we choose to use bowtie. (You could also use BWA or many other mappers.)

We've done this several times before, so you should be able to come up with the full command lines if you refer back to the original lesson.

Be careful we are now mapping single-end reads, so you may have to look at the bowtie help to figure out how to do that!

You will need to first build the index file, just once and in "interactive mode" is fine (it's fast, so you don't need an idev shell). Then, you will need to submit a commands file with four lines to the TACC queue.

Please give the final output files the names: SRR034450.sam, SRR034451.sam, SRR034452.sam, SRR034453.sam.

I just want a little hint

Remember, bowtie-build once then bowtie for each separate sample.

Just give me the answer...

module load bowtie
bowtie-build NC_017544.1.fasta NC_017544.1

Now create a commands file that looks like this:

bowtie -p 3 -S NC_017544.1 SRR034450.fastq -S SRR034450.sam
bowtie -p 3 -S NC_017544.1 SRR034451.fastq -S SRR034451.sam
bowtie -p 3 -S NC_017544.1 SRR034452.fastq -S SRR034452.sam
bowtie -p 3 -S NC_017544.1 SRR034453.fastq -S SRR034453.sam

Create the launcher script and run it:

module load python
launcher_creator.py -n bowtie -q development -c commands -t 0:30:00
qsub launcher.sge

Convert alignments to BAM

Edit your commands file so that you convert all of these files from SAM to sorted and indexed BAM.

Linux expert tip: you can string together commands all on one line, so that they are sent to the same core one after another by separating them on the line with &&.

Note the use of the variable $FILE, which means that is the only part of the line that we have to change. This is a mini-use of shell scripting.

FILE=SRR034450 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam
FILE=SRR034451 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam
FILE=SRR034452 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam
FILE=SRR034453 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam

Re-create the launcher script and submit this new job to the queue. Be sure you have samtools loaded as the node that your job launches on will inherit your current environment, including whatever modules you have loaded:

module load samtools
launcher_creator.py -n samtools -e you@somewhere.com
qsub launcher.sge

Optional Exercise

Is this a strand-specific RNA-seq library? Try using IGV to view some of the BAM file data and examine the reads mapped to each gene.

Count reads mapping to genes

bedtools

bedtools is a great utility for working with sequence features and mapped reads in BAM, BED, VCF, and GFF formats.

We are going to use it to count the number of reads that map to each gene in the genome. Load the module and check out the help for bedtools and the multicov specific command that we are going to use:

module load bedtools
bedtools
bedtools multicov

The multicov command takes a feature file (GFF) and counts how many reads are in certain regions from many input files. By default it counts how many reads overlap the feature on either strand, but it can be made specific with the -s option. Note: Remember that the chromosome names in your gff file should match the way the chromosomes are named in the reference fasta file used in the mapping step. For example, if gff file contains chr1, chrX etc, the GFF file must also call the chromosomes as chr1, chrX and so on.

Our GFF file has a lot of redundant features that describe a gene multiple times, so we are going to trim it just to have "gene" features using grep.

grep '^NC_017544[[:space:]]*GenBank[[:space:]]*gene' NC_017544.1.gff > NC_017544.1.genes.gff

What is this doing? It's taking all the lines that begin with (^), then "NC_017544", then any number of spaces or tabs, then "GenBank", then any number of spaces or tabs, then "gene". Use head to see the before and after.

head -n 50 NC_017544.1.gff
head -n 50 NC_017544.1.genes.gff

In order to use the bedtools command on our data, submit this commands file to the TACC queue:

bedtools multicov -s -bams SRR034450.bam SRR034451.bam SRR034452.bam SRR034453.bam -bed NC_017544.1.genes.gff > gene_counts.gff
head gene_counts.gff

Optional: HTseq

HTseq is another tool to count reads. bedtools has many many useful functions, and counting reads is just one of them. In contrast, HTseq is a specialized utility for counting reads, and it does not have many functions other than that. HTseq is very slow and you need to run multiple command lines in order to do the same job as what bedtools multicov did. Why do we learn this? Well, you may want to care about reads mapped on intersection when you count reads. Please take a look at this page, and if this sophisticated counting method looks useful for you, use HTseq. Otherwise, use bedtools.

grep "^NC_017544" NC_017544.1.gff > count_ref.gff
samtools view SRR034450.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count1.gff
samtools view SRR034451.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count2.gff
samtools view SRR034452.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count3.gff
samtools view SRR034453.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count4.gff

Analyze differential gene expression

DESeq

DESeq Manual and Instructions

Our data that is cluttered with a lot of extra columns and one column stuffed with tag=value information (including the gene names that we want!). Let's clean it up a bit before loading into R - which likes to work on simple tables. GFF are tab-delimited files.

We can do this cleanup many ways, but a quick one is to use the Unix string editor sed. This command replaces the entire beginning of the line up to locus_tag= with nothing (that is, it deletes it). This conveniently leaves us with just the locus_tag and the columns of read counts in each gene. If you were writing a real pipeline, you would probably want to use a Perl or Python script that would check to be sure that each line had the locus_tag (they do), among other things.

Reformatting gene_counts.gff

head gene_counts.gff
sed 's/^.*locus_tag=//' gene_counts.gff > gene_counts.tab

After it has run, take a peek at the new file:

head gene_counts.tab

Be very careful how you copy and paste from the example below.

Do not copy the > characters. Some commands are spread across multiple lines. The > are missing at the beginning of the lines after the first one in these cases. So this:

> y <- c(
    1:10
  )
> y

Is the same as:

> y <- c(1:10)
> y

It's ok to copy across the multiple lines and paste into R as long as you get all the way to the closing parenthesis.

The commands for this example are also described in the DESeq vignette (PDF) .

Using DESeq

login1$ R
...
> library("DESeq")
> counts = read.delim("gene_counts.tab", header=F, row.names=1)
> head(counts)
> colnames(counts) = c("wt1", "mut1", "wt2", "mut2")
> head(counts)
> my.design <- data.frame(
  row.names = colnames( counts ),
  condition = c( "wt", "mut", "wt", "mut"),
  libType = c( "single-end", "single-end", "single-end", "single-end" )
)
> conds <- factor(my.design$condition)

> cds <- newCountDataSet( counts, conds )
> cds

> cds <- estimateSizeFactors( cds )
> sizeFactors( cds )

> cds <- estimateDispersions( cds )

> pdf("DESeq-dispersion_estimates.pdf")
> plot(
  rowMeans( counts( cds, normalized=TRUE ) ),
  fitInfo(cds)$perGeneDispEsts,
  pch = '.', log="xy"
  )
> xg <- 10^seq( -.5, 5, length.out=300 )
> lines( xg, fitInfo(cds)$dispFun( xg ), col="red" )
> dev.off()

> result <- nbinomTest( cds, "wt", "mut" )
> head(result)

> result = result[order(result$pval), ]
> head(result)

> write.csv(result, "DESeq-wt-vs-mut.csv")

> pdf("DESeq-MA-plot.pdf")
> plot(
  result$baseMean,
  result$log2FoldChange,
  log="x", pch=20, cex=.3,
  col = ifelse( result$padj < .1, "red", "black" ) )
> dev.off()

> q()
Save workspace image? [y/n/c]: n
login1$ head DESeq-wt-vs-mut.csv

DESeq-wt-vs-mut.csv is a comma-delimited file that could be reloaded into R or viewed in Excel.

You should copy the two *.pdf files that were created back to your local computer to view them.

Exercises

What are the numbers returned by sizeFactors( cds )?
Answer...

They are, roughly speaking, the relative average coverage of each data set. Specifically, they are the size parameter of the negative binomial fit to the counts per gene per data file.
What are the dispersion estimates?
Answer...

The model assumes there is also a per-gene aspect to the variance in counts observed, that is again fit to a negative binomial distribution (=overdispersed Poisson distribution). In this model, the lower the counts are, the more dispersion relative to the mean is expected (red line in graph). Thus, higher fold changes are required in lowly expressed genes to call the same observed fold-change difference as significant.
What was the predominant effect of the mutation on gene expression in this Listeria strain?

Optional: edgeR

edgeR is another R package that you can use to do a similar analysis.

edgeR Manual and Instructions

These commands use the negative binomial model, calculate the false discovery rate (FDR ~ adjusted p-value), and make a plot similar to the one from DESeq.

Using edgeR

login1$ R
...
> library("edgeR")
> counts = read.delim("gene_counts.tab", header=F, row.names=1)
> colnames(counts) = c("wt1", "mut1", "wt2", "mut2")
> head(counts)
> group <- factor(c("wt","mut","wt","mut"))
> dge = DGEList(counts=counts,group=group)
> dge <- estimateCommonDisp(dge)
> dge <- estimateTagwiseDisp(dge)
> et <- exactTest(dge)
> etp <- topTags(et, n=100000)
> etp$table$logFC = -etp$table$logFC

> pdf("edgeR-MA-plot.pdf")
> plot(
  etp$table$logCPM,
  etp$table$logFC,
  xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3,
  col = ifelse( etp$table$FDR < .1, "red", "black" ) )
> dev.off()

> write.csv(etp$table, "edgeR-wt-vs-mut.csv")

Note that the "FC" fold change calculated is initially the reverse of that for the DESeq example for the output here. It is wt relative to mut. To fix this, we put a negative in there for the log fold change.

Exercises

Compare the expression changes predicted by DESeq and edgeR to each other.
Does edgeR or DESeq predict more significant changes?

Additional Points

In an actual RNAseq analysis, you might want to trim stray adaptor sequences from your data using a tool like the FASTX-Toolkit, FAR, or cutadapt before aligning.
You can get a lot more information from RNAseq data than you could from a microarray experiment. You can map transcriptional start sites, areas of unexpected transcription, splice sites, etc. - all because you have full sequence information that we have barely used in this example.

From here...

Visualize mapped reads in BAM files using IGV to manually check some of the gene counts.
Look at the more sophisticated "Tuxedo" suite of RNAseq tools, which performs many functions that are especially useful in Eukaryotic genomes.

Space shortcuts

Page tree

Overview

Learning Objectives

Table of Contents

Preliminary

Download data files

If you want to skip the read alignment step...

Install Bioconductor modules for R

Create BAM file of mapped reads

Map reads using Bowtie

Convert alignments to BAM

Optional Exercise

Count reads mapping to genes

bedtools

Optional: HTseq

Analyze differential gene expression

DESeq

Exercises

Optional: edgeR

Exercises

Additional Points

From here...