Exercise #2: CLIP-Seq

Objectives

In this exercise, you will use a recently developed R package called RIPSeeker to identify binding regions in a small CLIP-seq dataset. This dataset has been MASSIVELY down-sampled, since otherwise staging and processing the data would be difficult. As a result, this dataset should produce only ONE peak in the entire human genome. Let's try to find it!

Primary Data

This data is derived from a human (HeLa) Ago2 CLIP-seq experiment (Ameyar-zazoua M et al., Nat Struct Mol Biol. 2012;19(10):998-1004), and since you already copied it over at the start of the last exercise, it should be at the path:

ls $SCRATCH/ripseq_exercises/CLIP

When you look in the "CLIP" subdirectory, you should see these files:

input.chr19.bam
ip.chr19.bam
results
ripseeker.cmds
ripseeker_script.R

Results is a directory containing pre-made, well, results. The two BAM files are our test datasets, and they are a sub-sample of reads from only chr19. The ripseeker.cmds and ripseeker_script.R files are both commands files, which we will examine later. Before that, we need to make some adjustments so we can install some new libraries. First, execute these commands:

cd $SCRATCH/ripseq_exercises/CLIP
cp /scratch/02423/nsabell/RIPSeeker_1.4.0.tar.gz .
cp /scratch/02423/nsabell/ripseeker_script.R .
mkdir lib
chmod 777 lib

RIPSeeker

RIPSeeker is a Bioconductor R package, and we are going to run the job using the "Rscript" utility using the script and commands file that exist in the directory you just staged. Even on TINY files like the ones we are using, it will take a while to run, so we're going to kick it off now, and then look at what we're doing while it's processing.

The actual R commands are in a script called ripseeker_script.R, which is then called by a commands file using Rscript, which is a utility to run R scripts from the command line (or, in this case, a batch job) when it can't be interactive. So, to execute the script and its files, run these commands and use qstat to make sure the job is running.

cd $SCRATCH/ripseq_exercises/CLIP/
$BI/bin/launcher_creator.py -n ripseeker -t 01:00:00 -j ripseeker.cmds -q development -a CCBB
qsub ripseeker.sge
qstat

The output (when it is done) should deposit two files in the directory you submitted the job from. Those files are:

RIPregions.gff3
seekOutput.RData

The important file is RIPregions.gff3, which is a coordinate file of the regions predicted to be enriched in the IP. Because of the extremely slimmed-down dataset we used, there is only one peak, but this pipeline (with more reads that are actually real) would produce a file that is more interesting. The seekOutput.Rdata file is HUGE, but it can be loaded directly back into R. This is useful if you want to take the coordinate intervals in the GFF file (for example) and do something else with them.

Now, let's take a closer look at what we just did. Check out the "ripseeker_script.R" file by using cat or less:

install.packages(lib="./lib", pkgs="./RIPSeeker_1.4.0.tar.gz", repos=NULL)
library(RIPSeeker, lib.loc="./lib")
bamFiles = list.files(".", "\\.bam$", recursive=TRUE, full.names=TRUE)
ripGal=combineAlignGals(bamFiles[2], genomeBuild="hg19")
ctlGal=combineAlignGals(bamFiles[1], genomeBuild="hg19")

cNAME="input"
binSize=200
genomeBuild = "hg19"
outDir <- file.path(getwd())
reverseComplement = FALSE
uniqueHit = FALSE
assignMultihits = FALSE
rerunWithDisambiguatedMultihits = FALSE

seekOut.AGO2 <- ripSeek(bamPath = bamFiles, cNAME = cNAME, reverseComplement = FALSE, genomeBuild = "hg19", \
uniqueHit = FALSE, assignMultihits = FALSE, rerunWithDisambiguatedMultihits = FALSE, binSize=binSize, outDir=outDir)

Let's go through this section by section. First, we need to get the RIPSeeker packages from Bioconductor and load them into R.

> install.packages(lib="./lib", pkgs="./RIPSeeker_1.4.0.tar.gz", repos=NULL)
> library(RIPSeeker, lib.loc="./lib")

Then, we need to read in the BAM files we're going to use with their paths and use the RIPSeeker (or rather GenomicRanges) function "combineAlignGals" to generate a "gapped alignment" object in R using hg19 genomic and contig coordinates.

> bamFiles = list.files(".", "\\.bam$", recursive=TRUE, full.names=TRUE)
> ripGal=combineAlignGals(bamFiles[2], genomeBuild="hg19")
> ctlGal=combineAlignGals(bamFiles[1], genomeBuild="hg19")

Now, we're ready to execute the primary RIPSeeker function, appropriately called ripSeek(). Since it takes so many options, we're going to declare a bunch of variables before executing the command, and then have the command refer back to those already-declared variables. This is what it would look like:

> binSize=200  #Set the bin size, or have it be found automatically (between 200 and 1200 bp)
> cNAME="input"  #Which files are input
> genomeBuild = "hg19"   #Genome to use
> outDir <- file.path(getwd())  #Output file prefix (here, the current directory)
> reverseComplement = FALSE   #Do not reverse complement BAM entries
> uniqueHit = FALSE   #If TRUE, use only unique hits to train initial model
> assignMultihits = FALSE  #If TRUE and uniqueHit is TRUE, assign multi-hits to regions based on initial model
> rerunWithDisambiguatedMultihits = FALSE   #If TRUE and prior two options are TRUE, re-train model after multi-hit assignment
 
> seekOut.AGO2 <- ripSeek(bamPath = bamFiles, cNAME = cNAME, reverseComplement = FALSE, genomeBuild = "hg19", \
uniqueHit = FALSE, assignMultihits = FALSE, rerunWithDisambiguatedMultihits = FALSE, binSize=binSize, outDir=outDir)

So, now that you've generated all this data, go ahead and look at what the single peak is!

Click here to expand...

cat RIPregions.gff3
chr19	rtracklayer	sequence_feature	47545201	47545400	.	*	.	count=829; fpk=4145; logOddScore=65.8422503917559; pval=0; pvalAdj=0; CTL_count=1491; CTL_fpk=7455; CTL_logOddScore=-61.5676075214777; eFDR=0

Go here to look at it in the genome browser - see anything interesting?

We can also open R interactively to look at the contents of the big "seekOutput.RData" file we made. We need to use the R set up by the BioITeam, rather than the one in the module system, so we can use a version 3.x. These are a possible sequence of commands:

cd $SCRATCH/ripseq_exercises/CLIP/
$BI/bin/R
 
#Now we are in R
 
library(RIPSeeker, lib.loc="./lib")
rip_data = load("seekOutput.RData")
rip
# [1] "mainSeekOutputRIP" "RIPGRList"         "mainSeekOutputCTL"
# [4] "annotatedRIPGR"

Here, "rip_data" is a list of four objects that we just loaded into R, and each contains different sub-objects and information.

Object	Component Objects	Information Content
mainSeekOutputRIP	nbhGRList alignGal alignGalFiltered	Information such as the HMM posterior probabilities and filtered alignment objects for the RIP sample
RIPGRList	a GRangesList	A list (in GRanges object form) of all peak regions called by RIPSeeker
mainSeekOutputCTL	nbhGRList alignGal alignGalFiltered	Information such as the HMM posterior probabilities and filtered alignment objects for the CTL sample
annotatedRIPGR	sigGRangesAnnotated enrichedGO	Annotations for each of the regions flagged in RIPGRList, as well as calling of significantly enriched GO terms for those regions; this only works if "ChIPpeakAnno" is loaded and you set the functions biomaRt, biomaRt_dataset, and goAnno (see below)

Alternatives

We could have added these lines to the ripseeker_script.R file:

> biocLite("ChIPpeakAnno")
> library(ChIPpeakAnno)
 
> biomart="ensembl"   #Source of genome annotations
> biomaRt_dataset="hsapiens_gene_ensembl"   #Specific dataset from biomaRt source
> goAnno="org.Hs.eg.db"   #Annotation database

Which would automatically annotate the resulting peak coordinates with their gene names. We didn't do this because the package that is used, ChIPpeakAnno, has quite a few dependencies and takes a while to download (maybe ~15 minutes) and run.

Space shortcuts

Page tree

Objectives

Primary Data

RIPSeeker

Alternatives