You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Calling variants - a trivial use of an Interactive Session

We are going to conduct the variant calling exercises in an interactive idev session just so you can get a feel for this mode of computing. Almost everything you see here can easily be bundled up into batch scripts and run in that mode.

Introduction

Variant calling, at first glance, is pretty simple: Map sequence reads to an appropriate reference, emitting BAM files; Generate pileups and look for evidence of structural variation; Correct for false-positives due to the sequencing technology. Common file formats used in variant detection are:

  • BAM files containing alignments against a reference genome
  • Reference FASTA files containing genome sequence
  • VCF files to represent SNPs and small indels
  • BED files for specifying regions of the genome

Setting up

Go to the Terminal shell window in which you have launched an idev session:

  1. mkdir $WORK/variants && cd $WORK/variants
  2. Copy your own BWA alignment file into place: cp $WORK/bwa-align/hs37d5_allseqs_bwa.bam .
  3. Load up the modules we will need for this session
    1. module load samtools && export PATH=$PATH:$TACC_SAMTOOLS_DIR/bcftools

Now, we're ready for some variant-hunting!

Alignment statistics

Before diving into interpretation of bioinformatics results, it's nice to get some summary statistics. SAMtools has a tool 'flagstat' that makes it easy to do this for BAM files. Run samtools flagstat hs37d5_allseqs_bwa.bam and you should get:

4546280 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
3992174 + 0 mapped (87.81%:nan%)
4546280 + 0 paired in sequencing
2273140 + 0 read1
2273140 + 0 read2
39866 + 0 properly paired (0.88%:nan%)
3636642 + 0 with itself and mate mapped
355532 + 0 singletons (7.82%:nan%)
45710 + 0 with mate mapped to a different chr
16050 + 0 with mate mapped to a different chr (mapQ>=5)

As an aside, the reads whose mates map to alternate chromosomes may be revealing structural rearrangement. Most likely not, but among these reads is where you would look for evidence of such a thing.

Basic variant calling

Variant calling is basically a three-step process:

  1. First, samtools mpileup command transposes the mapped data in a sorted BAM file fully to genome-centric coordinates. It starts at the first base on the first chromosome for which there is coverage and prints out one line per base. Each line has information on every base observed in the raw data at that base position along with a lot of auxiliary information depending on which flags are set. It calculates the Bayseian prior probability given by the data, but does not estimate a real genotype.
  2. Next, bcftools with a few options added uses the prior probability distribution and the data to calculate an actual genotype for the variants detected.
  3. Finally, vcfutils.pl (or equivalent) is used to filter down the list of candidates according to some set of objective criteria.

Here's a basic set of commands to generate a BCF of genotypes.

First, be sure the BWA file is sorted and indexed

samtools sort hs37d5_allseqs_bwa.bam hs37d5_allseqs_bwa-sorted
samtools index hs37d5_allseqs_bwa-sorted.bam

Then, call variants and write to a VCF file

samtools mpileup -uf /corral-repl/utexas/BioITeam/tacc_ngs/human_variation/ref/hs37d5.fa \
   hs37d5_allseqs_bwa-sorted.bam \
   | bcftools view -vcg - > hs37d5_allseqs_bwa.raw.vcf

Open the 'hs37d5_allseqs_bwa.raw.vcf' file in a text editor and take a look around. Notice all the descriptive metadata in the comments (lines that start with the # character) - VCF is a very good format for bioinformatics!

Question: What version of the VCF standard is this file

The first line of the file is ##fileformat=VCFv4.1 and so this is a VCF 4.1 file

Find the variant in chromosome 20 at position 1592284 in the VCF file (It is probably the first variant record in the file).

If you are using 'less' to page through the VCF, you can hit the '?' key and type in 257836 and it will take you to the line where this polymorphism is found.

Question: What is the reference base for this SNP and what is the polymorhism?

The reference allele at this position is 'G' and the polymorphism is 'A'

Question: What is the read depth supporting this polymorphism?

Over in the INFO field, you will see a string DP=10. Looking up in the metadata at the top of the file, you see that DP is "Raw read depth". So, there are 14 reads supporting this polymorphism. Sounds like a potential winner!

Question: What is the quality of this SNP?

The QUAL field for this SNP is 39. VCF qualities are expressed on a PHRED scale, so this means there is a 1x10^-3.9^ chance that this SNP has been called in error. Pretty good, right?

Inspecting base-level alignments

So, you've identified a variant chr20:1592284, it's got good quality and read support, and you want to verify it for yourself. You can use any number of BAM browsers (IGV, SeqMonk, etc) but you can also do this right from the terminal using a command bundled with samtools called 'tview'.

Enter the following command

samtools tview hs37d5_allseqs_bwa-sorted.bam /corral-repl/utexas/BioITeam/tacc_ngs/human_variation/ref/hs37d5.fa

You should see a window that resembles the following screen shot.


 

Hit the '?' key to pull up help, then hit '?' again to dismiss help. Let's go to the SNP we examined before. Type 'n' to turn on color coding of nucleotides. Hit 'g' and you will be asked to enter a position in the genome, then enter 20:1592284 in this box and hit 'Return' and you will be transported to the location in reference genome. The base 1592284 will be the left-most column in the BAM browser.

  • No labels