Analyzing RNA-Seq data for differential gene expression

In this exercise, we will analyze RNA-seq data for gene expression levels in a wild-type and mutant strain of Listeria.

Download data files

The data files for this example are in the path:

/corral-repl/utexas/BioITeam/ngs_course/listeria_RNA_seq/data

File Name	Description	Sample
`SRR034450.fastq`	Single-end Illumina 36-bp reads	wild-type, biological replicate 1
`SRR034451.fastq`	Single-end Illumina 36-bp reads	ΔsigB mutant, biological replicate 1
`SRR034452.fastq`	Single-end Illumina 36-bp reads	wild-type, biological replicate 2
`SRR034453.fastq`	Single-end Illumina 36-bp reads	ΔsigB mutant, biological replicate 2
`NC_017544.1.gbk`	Reference Genome	Listeria monocytogenes strain 10403S

This data is from this paper:

Oliver, H.F., et al. (2009) Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs. BMC Genomics 10:641. Pubmed

You can view it in the ENA SRA here: http://www.ebi.ac.uk/ena/data/view/SRP001753

Install software

Bowtie

For RNA-seq analysis we're mainly counting the reads that align well, so we choose to use bowtie. (You could also use BWA or many other mappers).

module load bowtie

Bioconductor modules for R statistics package

Download and install R

login1$ module load R
login1$ R

R version 2.14.0 (2011-10-31)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> source("http://bioconductor.org/biocLite.R")
> biocLite("DESeq")
...
> biocLite("edgeR")
...

When you load R subsequent times, you can load these libraries with just these commands:

library("DESeq")
library("edgeR")

library(edgeR)
library(DESEQ)

Commands

Remove adaptor sequences from reads

You will need a FASTA file of adaptor sequences.

[https://wikis.utexas.edu/display/GSAF/Illumina+-+all+flavors Illumina Adapters Sequences (GSAF)] | [%ATTACHURL%/gsaf_illumina_adapters.fasta Download FASTA]

For each input file you will need to run this command (single-end data): %BR%
<code>$far --source datasetX.fastq --target datasetX.noadaptor --adaptive-overlap --trim-end any --adapters gsaf_illumina_adapters.fasta --format fastq-sanger</code>

There is an option to process paired-end data like this: %BR%
<code>$far --source datasetX_R1.fastq --source2 datasetX_R2.fastq --target datasetX.noadaptor --adaptive-overlap --trim-end any --adapters gsaf_illumina_adapters.fasta --format fastq-sanger</code>

---+++ Align reads to reference genome

---++++ Using bowtie2

First, index your genome so bowtie2 can map read to it: %BR%
<code>$bowtie2-build REL606.fna REL606</code> %BR%

Then, align each data set: %BR%
<code>$bowtie2 -x REL606 -U datasetX.fastq --phred33 -S REL606.sam</code> %BR%

Optionally, add the <code>--local</code> flag if your reads do not map end-to-end.

---++++ Using BWA
First, index your genome so BWA can map read to it: %BR%
<code>$bwa index REL606.fna</code> %BR%

Then, align each data set: %BR%
<code>$bwa aln REL606.fna dataset1.fastq > datasetX.sai </code> %BR%

And convert to SAM format (assumes single-end data):
<code>$bwa samse REL606.fna datasetX.sai datasetX.fastq > datasetX.sam </code> %BR%

---++ Count reads mapping to genes

<code>breseq RNASEQ -f REL606.fna -r REL606.gbk -o datasetX.count.tab datasetX.sam</code> %BR%

---++ Convert alignments to BAM

And convert to BAM format (assumes single-end data):

login1$ samtools faidx REL606.fna
login1$ samtools import REL606.fna datasetX.sam datasetX.unsorted.bam
login1$ samtools sort datasetX.unsorted.bam datasetX
login1$ samtools index datasetX.bam

Exercise: Is this a strand-specific RNA-seq library? Try using IGV to view some of the data.

hg2. Analyze differential gene expression

Use bedtools to count reads in features.
Converting mapped reads to feature counts.

hg3. Using DESeq

[http://bioconductor.org/packages/devel/bioc/html/DESeq.html Manual and Instructions]

DESeq installation

login1$ R
...
>source("http://bioconductor.org/biocLite.R")
>biocLite("DESeq")

Using DESeq

login1$ R
...
> library("DESeq")
> combined = read.csv("combined.csv", header=T, row.names=1)
> design <- data.frame(
  row.names = colnames( combined ),
  condition = c( "Anc", "Anc", "EL", "EL", "EW", "EW"),
  libType = c( "single-end", "single-end", "single-end",
  "single-end", "single-end", "single-end" ) )
> design
> conds <- factor(design$condition)
> cds <- newCountDataSet( combined, conds )
> cds <- estimateSizeFactors( cds )
> sizeFactors( cds )
> res <- nbinomTest( cds, "EL", "EW" )
> write.csv(res, "EL-vs-EW.csv")

Space shortcuts

Page tree

Analyzing RNA-Seq data for differential gene expression

Download data files

Install software

Bowtie

Bioconductor modules for R statistics package

Commands

Remove adaptor sequences from reads

Space shortcuts

Page tree

Differential gene expression analysis

Analyzing RNA-Seq data for differential gene expression

Download data files

Install software

Bowtie

Bioconductor modules for R statistics package

Commands

Remove adaptor sequences from reads