The utility of samtools mpileup

The samtools mpileup command can form the basis of a basic genotyper directly.

 

The command itself essentially transposes a bam file.  The rows of a sam/bam file are oriented around reads and give you very little context of the reference.  samtools mpileup rows are oriented around genome coordinates with information about all reads (base-pairs really) underlying that position.

 

Start with the basic command and see what it does on one of the BAM files (head gives us just the first 10 lines):

samtools mpileup NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | head

You'll see that the second to last field is what we're after.  The period "." and comma "," marks indicate that a read was present matching the reference - "." if the read is on the forward strand, "," if the read is on the reverse strand.  A/a, C/c, G/g, or T/t will be printed when a read contains that base (upper case = forward strand, lower case = reverse strand).  There are special characters indicating the start and end of reads.  You can also get the base quality scores out if you'd like too (last column).  Here is the full format specification.

 

Now add information about the reference and notice that you now have the reference base:

samtools mpileup -f ref/hs37d5.fa NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | head

 

You see from these first few lines that there aren't many base-pairs at each location.  That's because the read coverage is very low there.  Let's add a filter using awk that requires there to be at least 5 reads over a location.  This is easy because the fourth column contains the read depth at that location:

samtools mpileup -f ref/hs37d5.fa NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | tr /acgt/ /ACGT/ | awk '{if ($4>5) {print $2"\t"$3"\t"$5}}' | head

For good measure, I've also capitalized all the base-pairs (I don't care about strand information for the moment) and I'm only printing out the chromosome 20 location ($2), the reference base ($3), and the pileup of bases at that location ($5).

 

Now, I'd like to only look at lines that have variants.  I'd like to skip spurious sequencing errors, so I'll add one more awk command that counts the number of alternate base-pairs at a location and only prints out the line of the total alternate base-pairs are at least 40% of the read depth at that location (I've also added linux line continuation characters ("\") to separate this one command across many lines so you can read it more clearly:

samtools mpileup -f ref/hs37d5.fa NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | tr /acgt/ /ACGT/ | \
   awk '{if ($4>5) {print $2"\t"$3"\t"$5}}' | \
   awk '{m=0; \
      for (i=1;i<=length($3);i++) {\
         char=substr($3,i,1); \
         if ($2!=char&&(char=="A"||char=="C"||char=="G"||char=="T")) {
            m++\
 }\
 } \
      if (m>0.4*length($3)) {print $0"\t"m} \
   }'  | head

 

Now, redirect that to a file and check to see how concordant it is with your samtools or GATK output!

# Some reformatting so comm will work well:
cat all.samtools.vcf | awk '{print $2"AAAAAA"}' | sort > all.samtools.gt
cat all.GATK.vcf | awk '{print $2"AAAAAA"}' | sort > all.gatk.gt 
cat NA*.gt | awk '{print $1"AAAAAA"}' | sort | uniq > all.awk.gt
 
# Add some awk to tally things up:
comm all.samtools.gt all.gatk.gt | awk 'BEGIN {print "Only in samtools\tOnly in GATK\tIn both"; FS="\t"} {i[NF]++} END {print i[1]"\t"i[2]"\t"i[3]}'
comm all.awk.gt all.gatk.gt | awk 'BEGIN {print "Only in AWK calls\tOnly in GATK\tIn both"; FS="\t"} {i[NF]++} END {print i[1]"\t"i[2]"\t"i[3]}'
comm all.awk.gt all.samtools.gt | awk 'BEGIN {print "Only in AWK calls\tOnly in samtools\tIn both"; FS="\t"} {i[NF]++} END {print i[1]"\t"i[2]"\t"i[3]}'

 

Exercise left to the reader - what's the effect of variant quality score to this concordance question?

 

 

  • No labels