You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Overview:

Identification of variants in mixed population sequencing data uses the same principles as their identification in clonal (homogeneous) sources of DNA. The difference is that the number of reads supporting each variant must be counted and compared to all other variants at the same location to determine the frequency of said variant rather than just listing what variants are present. Under clonal sequencing conditions, sequencing errors can safely and effectively be ignored. Conversely, with mixed population sequencing, sequencing errors become the lower bound limit of detection, and the potential accuracy of the experiment. 

Here we will demonstrate effective use of breseq to identify variants in mixed population data and gain insight into some of the error correction breseq provides.

Learning Objectives:

  1. Identify variants in mixed population sequencing data.
  2. Understand the sources of false positive and false negative variants.
  3. Leverage knowledge of false positive errors to eliminate these types of errors

 

Tutorial:

The optional tutorial from day 2 (Advanced variant calling tutorial (GVA14)) detailed the use of the breseq pipeline to call variants on clonal samples from an evolving E. coli population. While this tutorial does not require completion of the previous tutorial, many of the finer points breseq are better covered there. This tutorial will focus on the use of the polymorphic mode of breseq to identify variants from a mixed population.

 

All fastq files necessary for this tutorial can be found inside the $BI/gva_course/mixed_population folder. Copy all fastq files to a new folder named fastq, and REL606.6.gbk to a new folder named reference.

mkdir fastq
mkdir reference
cp $BI/gva_course/mixed_population/*.fastq fastq
cp $BI/gva_course/mixed_population/REL606.6.gbk reference

By default breseq preforms several statistical tests to rule out false positives. To make use of these tests, simply add a -p flag to any breseq command. To highlight what breseq is normally doing by default we will run the same fastq files with and without several of the statistical tests. Specifically, base quality scores, polymorphism scores, polymorphism bias, and minimum strand coverage will be ignored. All 4 of these arguments can be found in the breseq -h output and their values should be set to 0. While running breseq in polymorphism mode is a fairly simple,  due to the complexity of the command with turning off all the additional options, it is recommended that you copy paste these commands into a commands file or an idev session.

Be sure to make a new folder named Logs or these commands will fail.

breseq -j 6 -p -o breseq_output/with_stats -r reference/REL606.6.gbk fastq/REL964_TACAGCA_L003_R1_002.fastq fastq/REL964_TACAGCA_L003_R2_002.fastq >& Logs/with_stats.log.txt
breseq -j 6 -p -b 0 --polymorphism-score-cutoff 0 --polymorphism-bias-cutoff 0 --polymorphism-minimum-coverage-each-strand 0 -o breseq_output/without_stats -r reference/REL606.6.gbk fastq/REL964_TACAGCA_L003_R1_002.fastq fastq/REL964_TACAGCA_L003_R2_002.fastq >& Logs/without_stats.log.txt

These commands will take ~25-30 minutes to finish running each. If running in an idev node, a single ampersand "&" can be added to the end of the line so the command will run in the background while allowing you to have your prompt back to run the other option. If you have already started the command before reading this a useful trick on linux systems to move a running process to the background is the following: 

Ctrl + z
jobs
%jobnumber & 
(in your case, you should only have 1 job running so you would type: "%1 &" for the previous line

While we wait for these 2 runs to complete, we will go over the source of some of the errors that pop up in this type of analysis, and things that can be done to try to correct for them...

  • No labels