Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This next exercise will give you some idea of how Annovar works; we've taken the liberty of writing the bash script annovar_pipe.sh around the existing summarize_annovar.pl wrapper (a wrapper within a wrapper - a common trick) to even further simplify the process for this course.

Running Annovar

Get some data:

First we want to move to a new location on $SCRATCH

...

Note that the above block does not include how to make the edits, nor the saving and closing of the slurm file. The needed edits are: 

Line numberAs isTo be
16

#SBATCH -J jobName

#SBATCH -J spades
17

#SBATCH -n 1

#SBATCH -n 6

22

##SBATCH --mail-user=ADD

#SBATCH --mail-user=<YourEmailAddress>

23

##SBATCH --mail-type=all

#SBATCH --mail-type=all

29

export LAUNCHER_JOB_FILE=commands

export LAUNCHER_JOB_FILE=annovar_commands

The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.

...

Again use ctl-o and ctl-x to save the file and exit.


Analyzing the results

Accessing pre-computed results

...

Everything after the "LJB_GERP++" field in exome_summary came from the original VCF file, so this file REALLY contains everything you need to go on to functional analysis!  This is one of the many reasons I like Annovar.

Scavenger hunts! and command line building

Expand
titleFind the gene with two frameshift deletions in NA12878. See what you can come up with as answer on your own and then click here for an answer and an expansion of how to generate more meaningful representations of that data

The final answer is "DEFB126"

Code Block
languagebash
grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv  # this will print all the lines which contains the text "frameshift"
# From the output you can key into the first few columns having the information you are interested in: location-classification, gene, mutation type each separated by commas. This should lead you to think about adding the awk command to print only some columns.
 
grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}'  # the -F"," syntax forces it to split on commas
# you will likely notice this data is easier to visualize, and in this case you can probably see what gene is represented multiple times, but why stop there ... lets add the uniq -c command to the pipes to have linux count for us
 
grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}' | uniq -c 
# for the number of mutations we have this is sufficient, but for increased numbers of mutations where you may be interested in displaying them in a particular order. This can be done by adding the sort command

grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}' | uniq -c | sort -r  # the -r option on the sort command  sorts in reverse order


...