DNAseq Variant Calling Pipeline

Identification and annotation of SNPs and/or somatic mutations compared to reference genome. 10 hour minimum ($730 internal, $930 external) per project.

1. Quality Assessment

Quality of data assessed by FastQC and SAMStat; results of quality assessment will be evaluated prior to downstream analysis.

Deliverables:
- reports generated by FastQC and SAMStat
- metrics specific to hybrid selection analysis calculated using Picard available as well

Tools Used:
- FastQC: (Andrews 2010) used to generate quality summaries of data:
  - Per base sequence quality report: useful for deciding if trimming necessary.
  - Sequence duplication levels: evaluation of library complexity.
  - Overrepresented sequences: evaluation of adapter contamination.
- SAMStat: (Lassman et. al. 2011) provides summary statistics at both fastq and SAM/BAM alignment levels.
- Picard CalculateHsMetrics: (http://broadinstitute.github.io/picard) evaluates hybrid selection protocols (target coverage and AT/GC dropout levels).

2. Mapping

Mapping to genome reference using BWA-mem (alternative algorithms available on request).

Deliverables:
- bam files from both the initial alignment (BWA-mem by default, though other algorithms are available if desired)
- bam files resulting from further processing using GATK

Tools Used:
- BWA-mem: (Li 2013) primary aligner used to generate first pass read alignments (BWA-aln and BWA-sampe also available if desired, as are bowtie/bowtie2).
- GATK: (McKenna et. al. 2010, Auwera et. al. 2013) IndelRealigner and BaseRecalibrator applied to correct indel-based misalignments and increase accuracy/dispersion of individual base quality scores

3a. Variant Calling Option 1: GATK

Genome Analysis Toolkit (GATK) used to call SNPs and indels according to best practices recommended by Broad institute.

Deliverables:
- individual sample vcf files output by HaplotypeCaller
- regenotyped and recalibrated merged vcf file output by GenotypeGVCFs

Tools Used (GATK):
- HaplotypeCaller: reassembles "active regions" and applies PairHMM algorithm to select most likely genotype
- GenotypeGVCFs: jointly re-genotypes, re-annotates and merges individual sample gVCFs from HaplotypeCaller into single aggregated vcf file
- VariantRecalibrator: recalibrates variant call probabilities based on call annotations

3b. Variant Calling Option 2: Somatic Mutation Identification

MuTect and MutSig from the Broad institute are available for calling somatic mutations; other methods may be available upon request as well.

Deliverables:
- MuTect and MutSig output files.

Tools Used:
- MuTect: (Cibulskis et. al. 2013) identifies somatic point mutations based on two Bayesian classifiers:
1. - LOD for observed tumor data given mutant site compared to observed tumor data given reference site,
  - LOD for observed normal data given reference site compared to observed normal data given mutant site.
- MutSig: (Lawrence et. al. 2013) assesses significance of mutation calls using null model based on background mutation processes.

4. Annotation

Further annotation of variant calls may be provided using ANNOVAR.

Deliverables:
- ANNOVAR output in tabular format (in plain text, csv, or excel format as desired).

Tools Used:
- ANNOVAR: (Wang et. al. 2010) provides functional annotation of genetic variation encompassing multiple modalities (e.g., gene and region annotation and/or filtration based on established data sets).

Space shortcuts

Page tree

1. Quality Assessment

2. Mapping

3a. Variant Calling Option 1: GATK

3b. Variant Calling Option 2: Somatic Mutation Identification

4. Annotation