Page tree
Skip to end of metadata
Go to start of metadata

Annotating Variants: Introduction

As we've already seen, determining the presence or absence of a variant from NGS data is not trivial. It is software dependent and has inherent trade-offs between sensitivity and specificity. Inevitably, the number of putative variants in a real data set is very large; for example the first samples from the 1000 Genomes project typically found 0.1% variants (3 million variants), approximately 10% of which had never been previously observed (300,000 novel variants per individual.) False positive discovery rates are also typically very high at this stage.

Auxiliary data is often used to reduce putative variants without compromising sensitivity. Examples of auxiliary data include other samples within a cohort, existing SNP databases, gene or other feature annotations, and sample-specific information such as pedigree:

  • By comparing genotypes across a set of samples and defining one as "reference" (or "wild type") enables other samples to be properly genotyped (i.e. 0/0 for hom. WT, 0/1 for het, 1/1 for hom. alt)
  • Existing SNP databases such as dbSNP or the vcf files from the 1000 genomes project may be used to reject "common" variants under the assertion that "common" means "non-disease causing".
  • Gene annotations allow for codon analysis to determine whether mutations are synonymous, non-synonymous, nonsense, mis-sense, or create early stop codons.
  • Pedigree information is particularly effective in mendelian autosomal recessive diseases; filtering for heterozygous mutations in parents and unaffected siblings which are homozygous in the proband usually yields a very small set of candidate variants.

Variant annotation tools perform the function of combining the raw putative variant calls with auxiliary data to add meaning ("annotation") to the variants. In many cases, the variant detection tool itself will add certain elements of annotation, such as a definition of the variant, a genotype call, a measure of likelihood, a haplotype score, and other measures of the raw data useful to reduce false positives. In other cases, the annotator will only require a vcf file combined with other auxiliary data.

Because these tools draw in information from may disparate sources, they can be very difficult to install, configure, use, and maintain. For example, the vcf files from the 1000 Genomes project are arranged in a deep ftp tree by date of data generation. Large genome centers spend significant resources managing these tools.

Annovar - one of the most powerful yet simple to run variant annotators available

Annovar is a variant annotator. Given a vcf file from an unknown sample and a host of existing data about genes, other known SNPs, gene variants, etc., Annovar will place the discovered variants in context.

Annovar comes pre-packaged with human auxiliary data which is updated by the authors on a regular basis. It is a well-constructed package in that there is one core program which can perform a variety of different types of annotation AND download the reference databases required.

The authors have also included a wrapper script which runs a fairly comprehensive set of annotations automatically. You may be asking yourself "where can I find this awesome program?", but hopefully by now your assumption is that it is either on TACC or in the BioITeam folder. Generally speaking "programs" that consist of a series of scripts without many complex dependencies can easily be installed in the $BI folders. While the most popular programs will eventually be turned into modules. Despite its power, you can find this program in the $BI folders.

This next exercise will give you some idea of how Annovar works; we've taken the liberty of writing the bash script around the existing wrapper (a wrapper within a wrapper - a common trick) to even further simplify the process for this course.

Get some data:

First we want to move to a new location on $SCRATCH

Solution to make a new directory on scratch. You should know this answer.
mkdir $SCRATCH/GVA_Annovar
cd $SCRATCH/GVA_Annovar
Use only 1 of the following copy commands to get some data
 # if you have already done the trios tutorial
cp $SCRATCH/GVA_Human_trios/raw_files/N*.vcf .  
# if you have not done the trios tutorial
cp $BI/ngs_course/human_variation/N*.vcf .
Copy the necessary database containing indexes that annovar will use
# note this can not be done on an idev node. The command is expected to take several minutes, and generate no output.
cp -r $BI/ref_genome/annovar/hg19_annovar_db .

Setting up the commands

The BioITeam has set up 2 helpful wrapper commands for using annovar. The first ( calls annovar and summarizes the results. The second ( does some file conversions to prepare for annovar and then calls the script. 

 Click here for a hint

The which command is used to give the location of a program or script that is in your $PATH.

Here is an easy one-liner to cat the contents of a script (note ` is a back-tick, not apostrophe)
cat `which`  # the command within the `` is evaluated first and the output placed within the `` marks to be evaluated by the outer command
 Figure out what is doing using the back tick trick you just learned

Again note the ` characters are "backtick", not apostrophes

Print out the text of the perl script
more `which`  

The more command was used to make it slightly easier to read and to remind you that there are multiple options of how to look a things (less, head, tail, etc), eventually you will develop your own habits but the more you are exposed to the fewer instances of angry eureka moments you will have. When working with unknown files or file types, cat should typically be avoided unless you run a wc -l command else you risk crashing your terminal window.

This script is written in perl (a programing language that while powerful, is not very favored among most computational biologists who typically prefer: Python, R, or Bash shell scripts) as a method of standardizing input to call a series of external commands. Such wrappers (or wrappers within wrappers) make your life much easier. breseq itself has wrappers for using bowtie2 and samtools which you ran separately on other data. As you increase your understanding of scripts and the command line by looking at what others have done you may begin to make your own wrappers and small scripts, but such wrappers are a great example of the BioITeam and other community resources can provide you with. While I have told you no one will ever care as much about your data as you do, quality of life issues regarding repetitive treatment of similar inputs and outputs may be easily solved by someone else.

Now let's run it on the .vcf files from the 3 individuals (NA12878, NA12891, and NA12892) from both the samtools and gatk output in the $BI/ngs_course/human_variation/ directory. (You may recognize these as the same individuals that we worked with on the Trios tutorial. Throughout the class we've been teaching you how to create a commands file using nano, but here we provide a more complex example of how you can generate a commands file. As you become more proficient with the command line, it is likely you will use various piping techniques to generate commands file. The following calls Perl to custom-create the 6 command lines needed and put them straight into a commands file:

Create the "commands" file to run annovar on six vcf files - use Perl to extract the sample name and mapper so the log files have meaningful, but shorter, names
ls *.vcf | \
  perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print " $_ hg19_annovar_db >$1.$2.log 2>&1 &\n";' > commands

Note how the print statement includes redirections for generating log files, and the entire output is redirected to a file named commands.

A helpful note to make your life better.

"commands" files are 1 way to both create a record of what it is that you are doing as well as an easy way to execute multiple commands simultaneously. In a previous tutorial (the advanced breseq tutorial) we showed you how to do this by adding an & as the last character of the line to force the command to execute in the background. Tomorrow we will go over the more common method of submitting "commands" files to the que to run as jobs rather than being spoiled with interacting with everything on idev nodes. In computational biology this instructor has often found that once I learn to do something a particular way, it takes an incredible amount of inertia to change how I do something even when i know it would help to do so. In an effort to learn from me making things harder on myself than need be, I urge you to STRONGLY consider not naming every list of commands you want to run "commands" but rather something that is actually descriptive, even "commands_DATE" would be helpful. As you start submitting your own jobs you will quickly be able to build bad habits, try to be aware of and avoid this one if possible.
 investigate the commands file to to determine what the piping command actually did (click for answer)
Think about the commands that let you look at what the contents of a file are
cat commands

Executing the commands

Remember to make sure you are on an idev done

For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

Create the submission script and submit
chmod +x commands

The easiest way to check if this is working is to use ls and see an explosion of new files. This will take quite a bit of time to complete running. You can check that things are running using the ps command, and looking at the new .log files which are created. As the commands take some time to run, pre-computed versions of these outputs are available so you can begin evaluating the results if you don't want to wait for them to finish. 

Accessing pre-computed results

Use this code block to copy the completed results to immediately begin evaluating the results. Additionally, when your results have completed running (likely >90 minutes as annovar not configured to use multiple processes), compare your results to these provided results.
mkdir provided_results
cd provided_results
cp $BI/ngs_course/human_variation/*chrom20.samtools* .
cp $BI/ngs_course/human_variation/*chrom20.GATK* .

ANNOVAR output

Annovar does a ton of work in assessing variants for us (though if you were going for clinical interpretation, you still have a long way to go - compare this to RUNES or CarpeNovo).  It provides all these output files:

Example ANNOVAR output on the NA12878 vcf file

The exome_summary.csv is probably the most useful files because it brings together nearly all the useful information.  Here are the fields in that file (see these docs for more information, or thAnnovar filter descriptions page here):

Funcexonic, splicing, ncRNA, UTR5, UTR3, intronic, upstream, downstream, intergenic
GeneThe common gene name
ExonicFuncframeshift insertion/deletion/block subst, stopgain, stoploss, nonframeshift ins/del/block stubst., nonsynonymous SNV, synonymous SNV, or Unknown
AAChange (in gene coordinates)
Conserved (i.e. SNP is in a conserved region)based on the UCSC 46-way conservation model
SegDup (snp is in a segmental dup. region)
ESP5400_ALLAlternate Allele Frequency in 3510 NHLBI ESP European American Samples

Alternative Allele Frequency in 1000 genomes pilot project 2012 Feb release (minor allele could be reference or alternative allele).

dbSNP132The id# in dbSNP if it exists
AVSIFTThe AVSIFT score of how deleterious the variant might be
LJB_PhyloPConservation score provided by dbNSFP which is re-scaled from original phylop score. The new score ranges from 0-1 with larger scores signifying higher conservation. A recommended cutoff threshold is 0.95. If the score > 0.95, the prediction is "conservative". if the score <0.95, the prediction is "non-conservative". 
LJB_SIFTSIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence.  Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated. 

Functional prediction score for non-syn variants from Polyphen2 provided by dbNSFP  (higher score represents functionally more deleterious). A score greater than 0.85 corresponds to prediciton of "probably damaging". The prediciton is "possbily damaging" if score is between 0.85 and 0.15, and "benign" if score is below 0.15.

LJB_LRTFunctional prediction score for non-syn variants from LRT provided by dbNSFP (higher score represents functionally more deleterious. It ranges from 0 to 1. This score needs to be combined with other information prediction. If a threshold has to be picked up under some situation, 0.995 can be used as starting point. 

Functional prediction score for non-syn variants from Mutation Taster provided by dbNSFP  (higher score represents functionally more deleterious). The score ranges from 0 to 1. Similar to LRT, the prediction is not entirely depending on the score alone. But if a threshold has to be picked, 0.5 is the recommended as the starting point.  

LJB_GERP++higher scores are more deleterious
RefReference base
ObsObserved base-pair or variant
SNP Quality value
filter information
(ALL the VCF info is here!!)
GT:PL:GQ for each file!

Everything after the "LJB_GERP++" field in exome_summary came from the original VCF file, so this file REALLY contains everything you need to go on to functional analysis!  This is one of the many reasons I like Annovar.

Scavenger hunts!

 Find the gene with two frameshift deletions in NA12878. See what you can come up with as answer on your own and then click here for an answer and an expansion of how to generate more meaningful representations of that data

The final answer is "DEFB126"

grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv  # this will print all the lines which contains the text "frameshift"
# From the output you can key into the first few columns having the information you are interested in: location-classification, gene, mutation type each separated by commas. This should lead you to think about adding the awk command to print only some columns.
grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}'  # the -F"," syntax forces it to split on commas
# you will likely notice this data is easier to visualize, and in this case you can probably see what gene is represented multiple times, but why stop there ... lets add the uniq -c command to the pipes to have linux count for us
grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}' | uniq -c 
# for the number of mutations we have this is sufficient, but for increased numbers of mutations where you may be interested in displaying them in a particular order. This can be done by adding the sort command

grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}' | uniq -c | sort -r  # the -r option on the sort command  sorts in reverse order
 Compare the output of GATK and SAMtools for the NA12878 sample to see if you can find any differences
grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}' | uniq -c | sort -r
grep "frameshift" NA12878.chrom20.samtools.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}' | uniq -c | sort -r

GATK detects 2 frameshift insertions in DNMT3B, and 1 each in PRNP and LAMA5. SAMtools detects 1 frameshift deletion in each of NTSR1 and CTSA that GATK does not detect.

 Test "genetic drift" vs. "functional selection" - e.g. is the distribution of variants different among non-coding regions, synonymous changes in coding regions, and non-synonymous changes in coding regions?

Compare the output of these three commands:

grep intergenic NA12878.chrom20.samtools.vcf.genome_summary.csv | awk 'BEGIN {FS=","} {print $26"\t"$27}' | sort | uniq -c | sort -n -r | head -20
grep exonic NA12878.chrom20.samtools.vcf.genome_summary.csv | grep -w synonymous | awk 'BEGIN {FS=","} {print $25"\t"$26}' | sort | uniq -c | sort -n -r | head -20
grep exonic NA12878.chrom20.samtools.vcf.genome_summary.csv | grep -w nonsynonymous | awk 'BEGIN {FS=","} {print $25"\t"$26}' | sort | uniq -c | sort -n -r | head -20

Do you notice a pattern?

What's the right statistical test to determine whether non-synonymous mutations might be under different selective pressure than intergenic or synonymous mutations from this data?

 Thinking back to our first example where we looked for a gene with 2 frameshift deletions, can you think of a way to look at all 3 patients' GATK analysis results at once, and look for the total number of mutations detected (sorted by gene) displayed in descending order?
The final form of the command that you had was:
grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}' | uniq -c | sort -r
Try to come up with a solution on your own before checking your answer here
cat *GATK.vcf.exome_summary.csv | grep "exonic" | awk -F"," '{print $2,"\t"$3}' | sort | uniq -c | sort -nr

Some questions for thought:

  1. Why might this type of analysis be useful?
  2. What other greps would make this more useful?

Other variant annotators:

Return to GVA2020

  • No labels