Info |
---|
We are going to conduct the variant calling exercises in an interactive idev session just so you can get a feel for this mode of computing. Almost everything you see here can easily be bundled up into batch scripts and run in that mode. |
Variant calling, at first glance, is pretty simple: Map sequence reads to an appropriate reference, emitting BAM files; Generate pileups and look for evidence of structural variation; Correct for false-positives due to the sequencing technology. Common file formats used in variant detection are:
- BAM files containing alignments against a reference genome
- Reference FASTA files containing genome sequence
- VCF files to represent SNPs and small indels
- BED files for specifying regions of the genome
Go to the Terminal shell window in which you have launched an idev session:
- Set your BASE variable
export BASE=/scratch/01374/vaughn/tacc_ngs
mkdir $WORK/variants && cd $WORK/variants
- Copy your own BWA alignment file into place:
cp $WORK/bwa-align/hs37d5_allseqs_bwa.bam .
- Load up the modules we will need for this session
module load samtools && export PATH=$PATH:$TACC_SAMTOOLS_DIR/bcftools
Now, we're ready for some variant-hunting!
Before diving into interpretation of bioinformatics results, it's nice to get some summary statistics. SAMtools has a tool 'flagstat' that makes it easy to do this for BAM files. Run samtools flagstat hs37d5_allseqs_bwa.bam
and you should get:
Code Block |
---|
4546280 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
3992174 + 0 mapped (87.81%:nan%)
4546280 + 0 paired in sequencing
2273140 + 0 read1
2273140 + 0 read2
39866 + 0 properly paired (0.88%:nan%)
3636642 + 0 with itself and mate mapped
355532 + 0 singletons (7.82%:nan%)
45710 + 0 with mate mapped to a different chr
16050 + 0 with mate mapped to a different chr (mapQ>=5)
|
Info |
---|
As an aside, the reads whose mates map to alternate chromosomes may be revealing structural rearrangement. Most likely not, but among these reads is where you would look for evidence of such a thing. |
Variant calling is basically a three-step process:
- First,
samtools mpileup
command transposes the mapped data in a sorted BAM file fully to genome-centric coordinates. It starts at the first base on the first chromosome for which there is coverage and prints out one line per base. Each line has information on every base observed in the raw data at that base position along with a lot of auxiliary information depending on which flags are set. It calculates the Bayseian prior probability given by the data, but does not estimate a real genotype. - Next,
bcftools
with a few options added uses the prior probability distribution and the data to calculate an actual genotype for the variants detected. - Finally,
vcfutils.pl
(or equivalent) is used to filter down the list of candidates according to some set of objective criteria.
Here's a basic set of commands to generate a BCF of genotypes.
First, be sure the BWA file is sorted and indexed
Code Block |
---|
samtools sort hs37d5_allseqs_bwa.bam hs37d5_allseqs_bwa-sorted
samtools index hs37d5_allseqs_bwa-sorted.bam
|
Then, call variants and write to a VCF file
Code Block |
---|
samtools mpileup -uf /corral-repl/utexas/BioITeam/tacc_ngs/human_variation/ref/hs37d5.fa \
hs37d5_allseqs_bwa-sorted.bam \
| bcftools view -vcg - > hs37d5_allseqs_bwa.raw.vcf
|
Open the 'hs37d5_allseqs_bwa.raw.vcf' file in a text editor and take a look around. Notice all the descriptive metadata in the comments (lines that start with the # character) - VCF is a very good format for bioinformatics!
Question: What version of the VCF standard is this file
Expand |
---|
|
The first line of the file is ##fileformat=VCFv4.1 and so this is a VCF 4.1 file |
Find the variant in chromosome 20 at position 1592284 in the VCF file (It is probably the first variant record in the file).
Expand |
---|
|
If you are using 'less' to page through the VCF, you can hit the '?' key and type in 257836 and it will take you to the line where this polymorphism is found. |
Question: What is the reference base for this SNP and what is the polymorhism?
Expand |
---|
|
The reference allele at this position is 'G' and the polymorphism is 'A' |
Question: What is the read depth supporting this polymorphism?
Expand |
---|
|
Over in the INFO field, you will see a string DP=10 . Looking up in the metadata at the top of the file, you see that DP is "Raw read depth". So, there are 10 reads supporting this polymorphism. Sounds like a potential winner! |
Question: What is the quality of this SNP?
Expand |
---|
|
The QUAL field for this SNP is 39. VCF qualities are expressed on a PHRED scale, so this means there is a 1x10^-3.9^ chance that this SNP has been called in error. Pretty good, right? |
So, you've identified a variant chr20:1592284
, it's got good quality and read support, and you want to verify it for yourself. You can use any number of BAM browsers (IGV, SeqMonk, etc) but you can also do this right from the terminal using a command bundled with samtools called 'tview'.
Enter the following command
Code Block |
---|
samtools tview hs37d5_allseqs_bwa-sorted.bam /corral-repl/utexas/BioITeam/tacc_ngs/human_variation/ref/hs37d5.fa
|
You should see a window that resembles the following screen shot.
Expand |
---|
| Browser View 1 |
---|
| Browser View 1 |
---|
|
|
Hit the '?' key to pull up help, then hit '?' again to dismiss help. Let's go to the SNP we examined before. Type 'n' to turn on color coding of nucleotides. Hit 'g' and you will be asked to enter a position in the genome, then enter 20:1592284
in this box and hit 'Return' and you will be transported to the location in reference genome. The base 1592284 will be the left-most column in the BAM browser.
Expand |
---|
| Browser View 2 |
---|
| Browser View 2 |
---|
|
|
The vcfutils.pl
script, bundled with bcftools inside SAMTools can provide useful stats and can perform some filtering of VCFs by specific criteria.
Code Block |
---|
|
$TACC_SAMTOOLS_DIR/bcftools/vcfutils.pl qstats hs37d5_allseqs_bwa.raw.vcf
|
Expand |
---|
|
Code Block |
---|
QUAL #non-indel #SNPs #transitions #joint ts/tv #joint/#ref #joint/#non-indel
91 30 30 7 0 0.3043 0.0000 0.0000 0.3043
74 60 60 15 0 0.3333 0.0000 0.0000 0.3636
62 91 91 21 0 0.3000 0.0000 0.0000 0.2400
54.5 122 122 30 0 0.3261 0.0000 0.0000 0.4091
49.1 153 153 41 0 0.3661 0.0000 0.0000 0.5500
45.3 183 183 45 0 0.3261 0.0000 0.0000 0.1538
40 225 225 60 0 0.3636 0.0000 0.0000 0.5556
37 257 257 70 0 0.3743 0.0000 0.0000 0.4545
34 295 295 78 0 0.3594 0.0000 0.0000 0.2667
31 329 329 90 0 0.3766 0.0000 0.0000 0.5455
28.5 359 359 98 0 0.3755 0.0000 0.0000 0.3636
26 396 396 109 0 0.3798 0.0000 0.0000 0.4231
24 427 427 116 0 0.3730 0.0000 0.0000 0.2917
22 459 459 131 0 0.3994 0.0000 0.0000 0.8824
21.8 555 555 166 0 0.4267 0.0000 0.0000 0.5738
20.6 586 586 179 0 0.4398 0.0000 0.0000 0.7222
19.8 621 621 187 0 0.4309 0.0000 0.0000 0.2963
18.8 655 655 197 0 0.4301 0.0000 0.0000 0.4167
17.8 694 694 213 0 0.4428 0.0000 0.0000 0.6957
16.9 736 736 227 0 0.4460 0.0000 0.0000 0.5000
15.9 766 766 242 0 0.4618 0.0000 0.0000 1.0000
14.9 804 804 252 0 0.4565 0.0000 0.0000 0.3571
13.9 850 850 271 0 0.4680 0.0000 0.0000 0.7037
13 890 890 283 0 0.4662 0.0000 0.0000 0.4286
12 945 945 302 0 0.4697 0.0000 0.0000 0.5278
11.1 996 996 320 0 0.4734 0.0000 0.0000 0.5455
10.2 1048 1048 343 0 0.4865 0.0000 0.0000 0.7931
9.31 1090 1090 358 0 0.4891 0.0000 0.0000 0.5556
8.44 1129 1129 367 0 0.4816 0.0000 0.0000 0.3000
7.59 1176 1176 385 0 0.4867 0.0000 0.0000 0.6207
6.79 1235 1235 409 0 0.4952 0.0000 0.0000 0.6857
6.02 1282 1282 429 0 0.5029 0.0000 0.0000 0.7407
5.29 1328 1328 446 0 0.5057 0.0000 0.0000 0.5862
4.61 1377 1377 464 0 0.5082 0.0000 0.0000 0.5806
3.98 1413 1413 480 0 0.5145 0.0000 0.0000 0.8000
3.41 1455 1455 495 0 0.5156 0.0000 0.0000 0.5556
3.01 1460 1460 495 0 0.5130 0.0000 0.0000 0.0000 |
|
Question: How many INDELs were identifed in your VCF file?
Expand |
---|
|
The 'grep' command may help you figure this out. |
Expand |
---|
|
Code Block |
---|
grep -c "INDEL" hs37d5_allseqs_bwa.raw.vcf
48
|
So there are 48 indels that were identifiable in this data set. |
Why filter?
With any variant caller you use, there will be an inherent trade-off between sensitivity and specificity. Typically, you would carry forward as much data as practical at each processing step, deferring final judgement until later so you have more information. For example, you might not want to exclude low coverage regions in one sample from your raw data because you may be able to infer a genotype based on family information later. This typically leads to NGS pipelines that maximize sensitivity, leading to a substantial number of false positives. Filtering after variant calling can be useful to eliminate false positives based on all the data available after numerous analyses. In the samtools/bcftools world, the vcfutils.pl
script provides a means to filter SNPs on many criteria.
Now, you will explore some filter settings for vcfutils.pl varFilter
to see how many SNPs get filtered out, using the linux tool xargs
to do a parameter sweep.
Code Block |
---|
title | Filtering, counting, and parameter iteration |
---|
|
# First - create a tiny shell script to run vcfutils, accepting a single parameter:
echo "#\!/bin/bash" > tiny.sh
echo "echo \"Sweeping with vcfutils.pl, min read depth of: \$1\" " >> tiny.sh
echo "$TACC_SAMTOOLS_DIR/bcftools/vcfutils.pl varFilter -Q 20 -d \$1 hs37d5_allseqs_bwa.raw.vcf | grep -v '^#' | wc -l " >> tiny.sh
# Now make it executable
chmod +x tiny.sh
# Use xargs to do a sweep of read depths
echo 2 3 4 5 6 7 | xargs -n 1 tiny.sh
|
Expand |
---|
|
Code Block |
---|
Sweeping with vcfutils.pl, min read depth of: 2
1487
Sweeping with vcfutils.pl, min read depth of: 3
996
Sweeping with vcfutils.pl, min read depth of: 4
696
Sweeping with vcfutils.pl, min read depth of: 5
548
Sweeping with vcfutils.pl, min read depth of: 6
388
Sweeping with vcfutils.pl, min read depth of: 7
316
|
This may be obvious, but there are fewer SNPs returned the higher read depth you require to support them. |
Homework: Try to update tiny.sh so that sweeps through Quality rather than read depth
You can use bedtools and some Linux built-in commands to compare your vcf file to one generated by BioITeam staff using some other data and means (you will need module load bedtools
).
Code Block |
---|
Your file: hs37d5_allseqs_bwa.raw.vcf
BioITeam staff VCF file: /corral-repl/utexas/BioITeam/tacc_ngs/human_variation/all.samtools.vcf
|
Expand |
---|
|
- Investigate
grep -c - Try
intersectBed and wc
|
*Question: How many SNPs are in your VCF file and the staff-generated VCF file?
Expand |
---|
|
Code Block |
---|
grep -c -v "^#" hs37d5_allseqs_bwa.raw.vcf
1507
grep -c -v "^#" /corral-repl/utexas/BioITeam/tacc_ngs/human_variation/all.samtools.vcf
80972 |
|
Question: How many SNPs are in common between the two files?
Expand |
---|
|
Code Block |
---|
intersectBed -a hs37d5_allseqs_bwa.raw.vcf -b /corral-repl/utexas/BioITeam/tacc_ngs/human_variation/all.samtools.vcf | wc -l
88
|
|
Question: What are the average and maximum quality values for the SNPs that are in your VCF file?
Expand |
---|
|
Code Block |
---|
grep -v '^#' hs37d5_allseqs_bwa.raw.vcf | awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'
Average qual: 22.85 Max qual: 113
|
There's probably no way you knew how to do this! Please check out the power of awk before you write YAPPS (Yet Another Perl or Python Script) - it's a lifesaver, and is named after a cute bird. |