Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titleUse only 1 of the following copy commands to get some data
collapsetrue
 # if you have already done the trios tutorial
cp $SCRATCH/GVA_Human_trios/raw_files/N*.vcf .  
 
# if you have not done the trios tutorial
cp $BI/ngs_course/human_variation/N*.vcf .

Get access to annovar

Unfortunately we have finally found a program that conda won't install for us. In a related matter, if you look at the annovar page itself, accessing the newest version of annovar is actually behind a registration wall, which is uncommon though not without precedent. Here we will therefore work with an old version of annovar, though if you use it in your own work it is 100% suggested/encouraged/etc that you work with the newest version.

The final complication is while the BioITeam has all of the required things to run annovar, stampede2 compute nodes can't access them. This means we need to copy a database of annotations, and several scripts to our scratch directory in order to run the analysis.

Code Block
languagebash
titleCopy the necessary database containing indexes that annovar will use
cp -r $BI
Code Block
languagebash
titleCopy the necessary database containing indexes that annovar will use
# note this can not be done on an idev node. The command is expected to take several minutes, and generate no output.
cp -r $BI/ref_genome/annovar/hg19_annovar_db .

This folder is very large, it is expected to take several minutes to copy.

Code Block
languagebash
titleannovar commands
cp /corral-repl/utexas/BioITeam/bin/annotate_variation.pl .
cp /corral-repl/utexas/BioITeam/bin/convert2annovar.pl .
cp /corral-repl/utexas/BioITeam/bin/summarize_annovar.pl .
cp /corral-repl/utexas/BioITeam/bin/annovarPipe.sh .

...

Now let's run it on the .vcf files from the 3 individuals (NA12878, NA12891, and NA12892) from both the samtools and gatk output in the $BI/ngs_course/human_variation/ directory. (You may recognize these as the same individuals that we worked with on the Trios tutorial. Throughout the class we've been teaching you how to create a commands file using nano, but here we provide a more complex example of how you can generate a commands file. As you become more proficient with the command line, it is likely you will use various piping techniques to generate commands file. The following calls Perl to custom-create the 6 command lines needed and put them straight into a commands file:

Code Block
titleCreate the "commands" file to run annovar on six vcf files - use Perl to extract the sample name and mapper so the log files have meaningful, but shorter, names
ls *.vcf | \
  perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovarPipe.sh $_ hg19_annovar_db >$1.$2.log 2>&1 \n";' > annovar_commands

Note how the print statement includes redirections for generating log files, and the entire output is redirected to a file named commands.

Expand
titleinvestigate the commands file to to determine what the piping command actually did (click for answer)
Code Block
languagebash
titleThink about the commands that let you look at what the contents of a file are
collapsetrue
cat commands

Executing the commands

Warning
titleRemember to make sure you are on an idev done

For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

Code Block
titleCreate the submission script and submit
chmod +x commands
./commands

. Throughout the class we've been teaching you how to create a commands file using nano, but here we provide a more complex example of how you can generate a commands file using perl. As you become more proficient with the command line, it is likely you will use various piping techniques such as these to generate commands file. The following calls Perl to custom-create the 6 command lines needed and put them straight into a commands file:

Code Block
titleCreate the "commands" file to run annovar on six vcf files - use Perl to extract the sample name and mapper so the log files have meaningful, but shorter, names
ls *.vcf | \
  perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovarPipe.sh $_ hg19_annovar_db >$1.$2.log 2>&1 \n";' > annovar_commands

Note how the print statement includes redirections for generating log files, and the entire output is redirected to a file named annovar_commands.


Expand
titleinvestigate the commands file to to determine what the piping command actually did (click for answer)
Code Block
languagebash
titleThink about the commands that let you look at what the contents of a file are
collapsetrue
cat annovar_commands


submitting the job

As you may have guessed when we started creating a commands file, this analysis is headed for the job queue system. As we have done elsewhere: copy the launcher file, and make relevant edits to the analysis we are attempting to perform.


Code Block
languagebash
titleModify your slurm file to control the queue system's computer
cp /corral-repl/utexas/BioITeam/gva_course/GVA2021.launcher.slurm annovar.slurm
nano annovar.slurm
sbatch annovar.slrum

Note that the above block does not include how to make the edits, nor the saving and closing of the slurm file. The needed edits are: 

Line numberAs isTo be
16

#SBATCH -J jobName

#SBATCH -J spades
17

#SBATCH -n 1

#SBATCH -n 6

22

##SBATCH --mail-user=ADD

#SBATCH --mail-user=<YourEmailAddress>

23

##SBATCH --mail-type=all

#SBATCH --mail-type=all

27

conda activate GVA2021


31

export LAUNCHER_JOB_FILE=commands

export LAUNCHER_JOB_FILE=annovar_commands

The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.

Line 27 for the first time does not include a conda activation command as we are not working within a conda environment. By the same token it would be acceptable to list any environment.

A 12 hour run is requested because while I have been able to verify the analysis is working with a subset of the data, I have not been able to get the full analysis to complete. I am assuming that it will complete in less than 12 hours and will again update this page when I have verified this.

Again use ctl-o and ctl-x to save the file and exit.The easiest way to check if this is working is to use ls and see an explosion of new files. This will take quite a bit of time to complete running. You can check that things are running using the ps command, and looking at the new .log files which are created. As the commands take some time to run, pre-computed versions of these outputs are available so you can begin evaluating the results if you don't want to wait for them to finish. 


Accessing pre-computed results

...

The exome_summary.csv is probably the most useful files because it brings together nearly all the useful information.  Here are the fields in that file (see these docs for more information, or thAnnovar filter descriptions page here):


Funcexonic, splicing, ncRNA, UTR5, UTR3, intronic, upstream, downstream, intergenic
GeneThe common gene name
ExonicFuncframeshift insertion/deletion/block subst, stopgain, stoploss, nonframeshift ins/del/block stubst., nonsynonymous SNV, synonymous SNV, or Unknown
AAChange (in gene coordinates)
Conserved (i.e. SNP is in a conserved region)based on the UCSC 46-way conservation model
SegDup (snp is in a segmental dup. region)
ESP5400_ALLAlternate Allele Frequency in 3510 NHLBI ESP European American Samples
1000g2010nov_ALL

Alternative Allele Frequency in 1000 genomes pilot project 2012 Feb release (minor allele could be reference or alternative allele).

dbSNP132The id# in dbSNP if it exists
AVSIFTThe AVSIFT score of how deleterious the variant might be
LJB_PhyloPConservation score provided by dbNSFP which is re-scaled from original phylop score. The new score ranges from 0-1 with larger scores signifying higher conservation. A recommended cutoff threshold is 0.95. If the score > 0.95, the prediction is "conservative". if the score <0.95, the prediction is "non-conservative". 
LJB_PhyloP_Pred
LJB_SIFTSIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence.  Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated. 
LJB_SIFT_Pred
LJB_PolyPhen2

Functional prediction score for non-syn variants from Polyphen2 provided by dbNSFP  (higher score represents functionally more deleterious). A score greater than 0.85 corresponds to prediciton of "probably damaging". The prediciton is "possbily damaging" if score is between 0.85 and 0.15, and "benign" if score is below 0.15.

LJB_PolyPhen2_Pred
LJB_LRTFunctional prediction score for non-syn variants from LRT provided by dbNSFP (higher score represents functionally more deleterious. It ranges from 0 to 1. This score needs to be combined with other information prediction. If a threshold has to be picked up under some situation, 0.995 can be used as starting point. 
LJB_LRT_Pred
LRT_MutationTaster

Functional prediction score for non-syn variants from Mutation Taster provided by dbNSFP  (higher score represents functionally more deleterious). The score ranges from 0 to 1. Similar to LRT, the prediction is not entirely depending on the score alone. But if a threshold has to be picked, 0.5 is the recommended as the starting point.  

LRT_MutationTaster_Pred
LJB_GERP++higher scores are more deleterious
Chr
Start
End
RefReference base
ObsObserved base-pair or variant
SNP Quality value
filter information
(ALL the VCF info is here!!)
GT:PL:GQ for each file!


Everything after the "LJB_GERP++" field in exome_summary came from the original VCF file, so this file REALLY contains everything you need to go on to functional analysis!  This is one of the many reasons I like Annovar.

...

Return to GVA2020GVA2021