ChIP-Seq Peak Calling Pipeline

This pipeline identifies regions of significant protein binding ("peaks") based on a reference genome. 12 hour minimum ($876 internal, $1116 external) per project.

1. Quality Assessment

Quality of data assessed by FastQC and aggregated with MultiQC. Results of quality assessment will be evaluated prior to downstream analysis.

Deliverables:
- Reports generated by FastQC and MultiQC.
Tools used:
- FastQC: (Andrews 2010) used to generate quality summaries of data:
  - Per base sequence quality report: useful for deciding if trimming necessary.
  - Sequence duplication levels: evaluation of library complexity.
  - Overrepresented sequences: evaluation of adapter contamination.
- MultiQC (https://multiqc.info/) used to aggregate FastQC, alignment, and other reports

2. Mapping

Mapping to genome reference performed using BWA.

Deliverables:
- Mapping results, as BAM files and mapping statistics.
Tools Used:
- BWA: (Li 2013) primary aligner used to generate read alignments.
- Samtools: (Li 2009) used to prepare BAMs and generate mapping statistics.
- In-house statistics generation scripts

3. Peak Calling

Counting the number of normalized ChIP-seq reads compared to a background control (Input or mock ChIP) to identify regions of binding enrichment.

Deliverables:
- Peak calls as narrowPeak (BED 6+) files, containing p-value, q-value, and fold enrichment scores for each peak.
- Per-base normalized signal files as bigWigs.
Tools Used:
- MACS2: (Zhang, 2008) used to identify and score peak regions.
- bedtools (Quinlan, 2010) used for optional blacklist filtering.

4. Significance Threshold Analysis

Statistical analysis and informed heuristics to determine appropriate significance threshhold(s) for identifying peaks for downstream analysis.

Deliverables:
- Summary file outlining peak counts at selected levels (High, Medium, and Low stringency) and master file containing counts over a wide range of q-values and fold enrichment values. Peak count vs q-value and fold enrichment plots.
Tools Used:
- R and in-house scripts used to produce peak count statistics and plots.

5. Downstream Regulation Analysis (optional)

Identification of potential genes regulated by TF binding. Tools and deliverables based on consultation with customer.

Space shortcuts

Page tree

1. Quality Assessment

2. Mapping

3. Peak Calling

4. Significance Threshold Analysis

5. Downstream Regulation Analysis (optional)