Page History

...

Code Block

language	bash
title	Use only 1 of the following copy commands to get some data
collapse	true

 # if you have already done the trios tutorial
cp $SCRATCH/GVA_Human_trios/raw_files/N*.vcf .  
 
# if you have not done the trios tutorial
cp $BI/ngs_course/human_variation/N*.vcf .

Get access to annovar

Unfortunately we have finally found a program that conda won't install for us. In a related matter, if you look at the annovar page itself, accessing the newest version of annovar is actually behind a registration wall, which is uncommon though not without precedent. Here we will therefore work with an old version of annovar, though if you use it in your own work it is 100% suggested/encouraged/etc that you work with the newest version.

The final complication is while the BioITeam has all of the required things to run annovar, stampede2 compute nodes can't access them. This means we need to copy a database of annotations, and several scripts to our scratch directory in order to run the analysis.

Code Block

language	bash
title	Copy the necessary database containing indexes that annovar will use

cp -r $BI

Code Block

language	bash
title	Copy the necessary database containing indexes that annovar will use

# note this can not be done on an idev node. The command is expected to take several minutes, and generate no output.
cp -r $BI/ref_genome/annovar/hg19_annovar_db .

This folder is very large, it is expected to take several minutes to copy.

Code Block

language	bash
title	annovar commands

cp /corral-repl/utexas/BioITeam/bin/annotate_variation.pl .
cp /corral-repl/utexas/BioITeam/bin/convert2annovar.pl .
cp /corral-repl/utexas/BioITeam/bin/summarize_annovar.pl .
cp /corral-repl/utexas/BioITeam/bin/annovarPipe.sh .

...

Now let's run it on the .vcf files from the 3 individuals (NA12878, NA12891, and NA12892) from both the samtools and gatk output in the $BI/ngs_course/human_variation/ directory. (You may recognize these as the same individuals that we worked with on the Trios tutorial. Throughout the class we've been teaching you how to create a commands file using nano, but here we provide a more complex example of how you can generate a commands file. As you become more proficient with the command line, it is likely you will use various piping techniques to generate commands file. The following calls Perl to custom-create the 6 command lines needed and put them straight into a commands file:

Code Block

title	Create the "commands" file to run annovar on six vcf files - use Perl to extract the sample name and mapper so the log files have meaningful, but shorter, names

ls *.vcf | \
  perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovarPipe.sh $_ hg19_annovar_db >$1.$2.log 2>&1 \n";' > annovar_commands

Note how the print statement includes redirections for generating log files, and the entire output is redirected to a file named commands.

Expand

title	investigate the commands file to to determine what the piping command actually did (click for answer)

Code Block

language	bash
title	Think about the commands that let you look at what the contents of a file are
collapse	true

cat commands

Executing the commands

Warning

title	Remember to make sure you are on an idev done

For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

Code Block

title	Create the submission script and submit

chmod +x commands
./commands

. Throughout the class we've been teaching you how to create a commands file using nano, but here we provide a more complex example of how you can generate a commands file using perl. As you become more proficient with the command line, it is likely you will use various piping techniques such as these to generate commands file. The following calls Perl to custom-create the 6 command lines needed and put them straight into a commands file:

Code Block

title	Create the "commands" file to run annovar on six vcf files - use Perl to extract the sample name and mapper so the log files have meaningful, but shorter, names

ls *.vcf | \
  perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovarPipe.sh $_ hg19_annovar_db >$1.$2.log 2>&1 \n";' > annovar_commands

Note how the print statement includes redirections for generating log files, and the entire output is redirected to a file named annovar_commands.

Expand

title	investigate the commands file to to determine what the piping command actually did (click for answer)

Code Block

language	bash
title	Think about the commands that let you look at what the contents of a file are
collapse	true

cat annovar_commands

submitting the job

As you may have guessed when we started creating a commands file, this analysis is headed for the job queue system. As we have done elsewhere: copy the launcher file, and make relevant edits to the analysis we are attempting to perform.

Code Block

language	bash
title	Modify your slurm file to control the queue system's computer

cp /corral-repl/utexas/BioITeam/gva_course/GVA2021.launcher.slurm annovar.slurm
nano annovar.slurm
sbatch annovar.slrum

Note that the above block does not include how to make the edits, nor the saving and closing of the slurm file. The needed edits are:

Line number	As is	To be
16	#SBATCH -J jobName	#SBATCH -J spades
17	#SBATCH -n 1	#SBATCH -n 6
22	##SBATCH --mail-user=ADD	#SBATCH --mail-user=<YourEmailAddress>
23	##SBATCH --mail-type=all	#SBATCH --mail-type=all
27	conda activate GVA2021
31	export LAUNCHER_JOB_FILE=commands	export LAUNCHER_JOB_FILE=annovar_commands

The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.

Line 27 for the first time does not include a conda activation command as we are not working within a conda environment. By the same token it would be acceptable to list any environment.

A 12 hour run is requested because while I have been able to verify the analysis is working with a subset of the data, I have not been able to get the full analysis to complete. I am assuming that it will complete in less than 12 hours and will again update this page when I have verified this.

Again use ctl-o and ctl-x to save the file and exit.The easiest way to check if this is working is to use ls and see an explosion of new files. This will take quite a bit of time to complete running. You can check that things are running using the ps command, and looking at the new .log files which are created. As the commands take some time to run, pre-computed versions of these outputs are available so you can begin evaluating the results if you don't want to wait for them to finish.

Accessing pre-computed results

...

The exome_summary.csv is probably the most useful files because it brings together nearly all the useful information. Here are the fields in that file (see these docs for more information, or the Annovar filter descriptions page here):

Func	exonic, splicing, ncRNA, UTR5, UTR3, intronic, upstream, downstream, intergenic
Gene	The common gene name
ExonicFunc	frameshift insertion/deletion/block subst, stopgain, stoploss, nonframeshift ins/del/block stubst., nonsynonymous SNV, synonymous SNV, or Unknown
AAChange (in gene coordinates)
Conserved (i.e. SNP is in a conserved region)	based on the UCSC 46-way conservation model
SegDup (snp is in a segmental dup. region)
ESP5400_ALL	Alternate Allele Frequency in 3510 NHLBI ESP European American Samples
1000g2010nov_ALL	Alternative Allele Frequency in 1000 genomes pilot project 2012 Feb release (minor allele could be reference or alternative allele).
dbSNP132	The id# in dbSNP if it exists
AVSIFT	The AVSIFT score of how deleterious the variant might be
LJB_PhyloP	Conservation score provided by dbNSFP which is re-scaled from original phylop score. The new score ranges from 0-1 with larger scores signifying higher conservation. A recommended cutoff threshold is 0.95. If the score > 0.95, the prediction is "conservative". if the score <0.95, the prediction is "non-conservative".
LJB_PhyloP_Pred
LJB_SIFT	SIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence. Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated.
LJB_SIFT_Pred
LJB_PolyPhen2	Functional prediction score for non-syn variants from Polyphen2 provided by dbNSFP (higher score represents functionally more deleterious). A score greater than 0.85 corresponds to prediciton of "probably damaging". The prediciton is "possbily damaging" if score is between 0.85 and 0.15, and "benign" if score is below 0.15.
LJB_PolyPhen2_Pred
LJB_LRT	Functional prediction score for non-syn variants from LRT provided by dbNSFP (higher score represents functionally more deleterious. It ranges from 0 to 1. This score needs to be combined with other information prediction. If a threshold has to be picked up under some situation, 0.995 can be used as starting point.
LJB_LRT_Pred
LRT_MutationTaster	Functional prediction score for non-syn variants from Mutation Taster provided by dbNSFP (higher score represents functionally more deleterious). The score ranges from 0 to 1. Similar to LRT, the prediction is not entirely depending on the score alone. But if a threshold has to be picked, 0.5 is the recommended as the starting point.
LRT_MutationTaster_Pred
LJB_GERP++	higher scores are more deleterious
Chr
Start
End
Ref	Reference base
Obs	Observed base-pair or variant
SNP Quality value
filter information
(ALL the VCF info is here!!)
GT:PL:GQ for each file!

Everything after the "LJB_GERP++" field in exome_summary came from the original VCF file, so this file REALLY contains everything you need to go on to functional analysis! This is one of the many reasons I like Annovar.

...

Return to GVA2020GVA2021

Space shortcuts

Page tree

Versions Compared

Old Version 1

New Version Current

Key

Get access to annovar

Executing the commands

submitting the job

Accessing pre-computed results