MAINTENANCE OUTAGE: The University Wiki Service will undergo maintenance on September 26th, 2017, from 6 pm to 8 pm. During this 2 hour time period https://wikis.utexas.edu may be unavailable. Users are advised to save content locally that may be needed during this time and to otherwise save all edits as unsaved work may be lost. Please contact the UT Service Desk at 512-475-9400 for any questions.
The University Wiki Service has upgraded the Confluence Server software, from version 5.9.14 to 5.10.8. Please refer to the knowledge base article, KB0015891, for a high level summary of upgrade changes. Thank you!
Skip to end of metadata
Go to start of metadata

Identification and annotation of SNPs and/or somatic mutations compared to reference genome. 10 hour minimum ($470 internal, $600 external) per project.

1. Quality Assessment

Quality of data assessed by FastQC and SAMStat; results of quality assessment will be evaluated prior to downstream analysis.

  • Deliverables:
    • reports generated by FastQC and SAMStat
    • metrics specific to hybrid selection analysis calculated using Picard available as well
  • Tools Used:
    • FastQC: (Andrews 2010) used to generate quality summaries of data:
      • Per base sequence quality report: useful for deciding if trimming necessary.
      • Sequence duplication levels: evaluation of library complexity.
      • Overrepresented sequences: evaluation of adapter contamination.
    • SAMStat: (Lassman et. al. 2011) provides summary statistics at both fastq and SAM/BAM alignment levels.
    • Picard CalculateHsMetrics: (http://broadinstitute.github.io/picard) evaluates hybrid selection protocols (target coverage and AT/GC dropout levels).

2. Mapping

Mapping to genome reference using BWA-mem (alternative algorithms available on request).

  • Deliverables:
    • bam files from both the initial alignment (BWA-mem by default, though other algorithms are available if desired)
    • bam files resulting from further processing using GATK
  • Tools Used:
    • BWA-mem: (Li 2013) primary aligner used to generate first pass read alignments (BWA-aln and BWA-sampe also available if desired, as are bowtie/bowtie2).
    • GATK: (McKenna et. al. 2010, Auwera et. al. 2013) IndelRealigner and BaseRecalibrator applied to correct indel-based misalignments and increase accuracy/dispersion of individual base quality scores

3a. Variant Calling Option 1: GATK

Genome Analysis Toolkit (GATK) used to call SNPs and indels according to best practices recommended by Broad institute.

  • Deliverables:
    • individual sample vcf files output by HaplotypeCaller
    • regenotyped and recalibrated merged vcf file output by GenotypeGVCFs
  • Tools Used (GATK):
    • HaplotypeCaller: reassembles "active regions" and applies PairHMM algorithm to select most likely genotype
    • GenotypeGVCFs: jointly re-genotypes, re-annotates and merges individual sample gVCFs from HaplotypeCaller into single aggregated vcf file
    • VariantRecalibrator: recalibrates variant call probabilities based on call annotations

3b. Variant Calling Option 2: Somatic Mutation Identification

MuTect and MutSig from the Broad institute are available for calling somatic mutations; other methods may be available upon request as well.

  • Deliverables:
    • MuTect and MutSig output files.
  • Tools Used:
    • MuTect: (Cibulskis et. al. 2013) identifies somatic point mutations based on two Bayesian classifiers:
      • LOD for observed tumor data given mutant site compared to observed tumor data given reference site,
      • LOD for observed normal data given reference site compared to observed normal data given mutant site.
    • MutSig: (Lawrence et. al. 2013) assesses significance of mutation calls using null model based on background mutation processes.

4. Annotation

Further annotation of variant calls may be provided using ANNOVAR.

  • Deliverables:
    • ANNOVAR output in tabular format (in plain text, csv, or excel format as desired).
  • Tools Used:
    • ANNOVAR: (Wang et. al. 2010) provides functional annotation of genetic variation encompassing multiple modalities (e.g., gene and region annotation and/or filtration based on established data sets).

 

  • No labels