Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titleThis linux one-liner should give you a snapshot of data sufficient to figure it out:
collapsetrue
cat trios_tutorial.all.samtools.vcf | headtail -10000 | awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | grep "0/0" | sed s/':'/' \t'/g | awk '{print $2"\t"$5$4"\t"$8$6}' | tail -100 | sort | uniq -c | sort -n -r
Expand
titleExplanation of command

Here are the steps going into this command:

  1. cat trios_tutorial.all.samtools.vcf |
    1. Dump the contents of trios_tutorial.all.samtools.vcf and pipe it to the next command
  2. headtail -10000 |
    1. Take the firstlast 10,000 lines and pipe it to the next command. As the top of the file has header information, the last lines are all data
  3. awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | 
    1. If the variant quality score (the 6th column or $6) is greater than 500, then print the following fields 2 (SNP position), 10, 11, and 12 (the 3 genotypes). and pipe to next command
  4. grep "0/0" |
    1. Filter for only lines that have at least one homozygous SNP and pipe them to the next command
    2.  Think about genetics and why this is important. If you aren't sure ask us.
  5. sed s/':'/' \t'/g | awk '{print $2"\t"$5$4"\t"$8$6}' |
    1. Break the genotype call apart from other information about depth: "sed" turns the colons into spaces tabs so that awk can just print the genotype fields. and pipe to next output
  6. tail -100 | sort | uniq -c | sort -n -r
    1. Take the last 100 lines. 100 is used to ensure we get some good informative counts, but not so many that noise becomes a significant problem.
    2. sort them, 
    3. then count the unique lines
    4. sort them again, in numeric order, and print them in reversed order
No Format
titleexample output of sample solution
collapsetrue
     12 0/0	0/1	0/0
      5 0/0	0/1	0/1
      3 0/1	0/0	0/0
      4 0/1	0/0	0/1
      8 0/1	0/0	1/1
     43 0/1	0/1	0/0
     24 0/1	1/1	0/0
      1 1/1	0/1	0
     34 0/1	0/1	0/0

     20 0/1	0/0	0/1

     20 0/0	0/1	0/0

     14 0/1	1/1	0/0

      6 0/0	0/1	0/1

      4 1/1	1/1	0/0

      1 1/1	0/1	0/0

      1 0/1	0/0	0/0
Expand
titleDiscussion of the output

Here is my interpretation of the data:

1) This method effectively looks at a very narrow genomic region, probably within a homologous recombination block.

2) The most telling data: the child will have heterozygous SNPs from two homozygous parents.

3) So all this data is consistent with column 1 (NA12878) being the child:

	 12 0/0	0/1	0/0
5 0/0 0/1 0/1
4 0/1 0/0 0/1
8 0/1 0/0 1/1
43 0/1 0/1 0/0
24 0/1 1/1 0/0

"Outlier" data are:

      3 0/1	0/0	0/0
      1 1/1	0/1	0/0
 

This is, in fact, the correct assessment - NA12878 is the child.

...