Analyzing RNA-Seq data for differential gene expression
In this exercise, we will analyze RNA-seq data for gene expression levels in a wild-type and mutant strain of Listeria.
Download data files
The data files for this example are in the path:
/corral-repl/utexas/BioITeam/ngs_course/listeria_RNA_seq/data
File Name |
Description |
Sample |
---|---|---|
|
Single-end Illumina 36-bp reads |
wild-type, biological replicate 1 |
|
Single-end Illumina 36-bp reads |
ΔsigB mutant, biological replicate 1 |
|
Single-end Illumina 36-bp reads |
wild-type, biological replicate 2 |
|
Single-end Illumina 36-bp reads |
ΔsigB mutant, biological replicate 2 |
|
Reference Genome |
Listeria monocytogenes strain 10403S |
This data is from this paper:
Oliver, H.F., et al. (2009) Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs. BMC Genomics 10:641. Pubmed
You can view the data in the ENA SRA here: http://www.ebi.ac.uk/ena/data/view/SRP001753
Install software
Bowtie
For RNA-seq analysis we're mainly counting the reads that align well, so we choose to use bowtie. (You could also use BWA or many other mappers).
module load bowtie
Bioconductor modules for R statistics package
login1$ module load R login1$ R R version 2.14.0 (2011-10-31) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-unknown-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > source("http://bioconductor.org/biocLite.R") > biocLite("DESeq") ... > biocLite("edgeR") ...
When you load R subsequent times, you can load these libraries with just these commands:
library("DESeq") library("edgeR")
- library(edgeR)
- library(DESEQ)
Commands
Remove adaptor sequences from reads
You will need a FASTA file of adaptor sequences.
- [https://wikis.utexas.edu/display/GSAF/Illumina+-+all+flavorsIllumina Adapters Sequences (GSAF)] | [%ATTACHURL%/gsaf_illumina_adapters.fastaDownload FASTA]
For each input file you will need to run this command (single-end data): %BR%
<code>$far --source datasetX.fastq --target datasetX.noadaptor --adaptive-overlap --trim-end any --adapters gsaf_illumina_adapters.fasta --format fastq-sanger</code>
There is an option to process paired-end data like this: %BR%
<code>$far --source datasetX_R1.fastq --source2 datasetX_R2.fastq --target datasetX.noadaptor --adaptive-overlap --trim-end any --adapters gsaf_illumina_adapters.fasta --format fastq-sanger</code>
---+++ Align reads to reference genome
---++++ Using bowtie2
First, index your genome so bowtie2 can map read to it: %BR%
<code>$bowtie2-build REL606.fna REL606</code> %BR%
Then, align each data set: %BR%
<code>$bowtie2 -x REL606 -U datasetX.fastq --phred33 -S REL606.sam</code> %BR%
Optionally, add the <code>--local</code> flag if your reads do not map end-to-end.
---++++ Using BWA
First, index your genome so BWA can map read to it: %BR%
<code>$bwa index REL606.fna</code> %BR%
Then, align each data set: %BR%
<code>$bwa aln REL606.fna dataset1.fastq > datasetX.sai </code> %BR%
And convert to SAM format (assumes single-end data):
<code>$bwa samse REL606.fna datasetX.sai datasetX.fastq > datasetX.sam </code> %BR%
---++ Count reads mapping to genes
<code>breseq RNASEQ -f REL606.fna -r REL606.gbk -o datasetX.count.tab datasetX.sam</code> %BR%
---++ Convert alignments to BAM
And convert to BAM format (assumes single-end data):
login1$ samtools faidx REL606.fna login1$ samtools import REL606.fna datasetX.sam datasetX.unsorted.bam login1$ samtools sort datasetX.unsorted.bam datasetX login1$ samtools index datasetX.bam
Exercise: Is this a strand-specific RNA-seq library? Try using IGV to view some of the data.
hg2. Analyze differential gene expression
- Use bedtools to count reads in features.
- Converting mapped reads to feature counts.
hg3. Using DESeq
login1$ R ... >source("http://bioconductor.org/biocLite.R") >biocLite("DESeq")
login1$ R ... > library("DESeq") > combined = read.csv("combined.csv", header=T, row.names=1) > design <- data.frame( row.names = colnames( combined ), condition = c( "Anc", "Anc", "EL", "EL", "EW", "EW"), libType = c( "single-end", "single-end", "single-end", "single-end", "single-end", "single-end" ) ) > design > conds <- factor(design$condition) > cds <- newCountDataSet( combined, conds ) > cds <- estimateSizeFactors( cds ) > sizeFactors( cds ) > res <- nbinomTest( cds, "EL", "EW" ) > write.csv(res, "EL-vs-EW.csv")