You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Introduction

breseq is a tool developed by the Barrick lab intended for analyzing genome re-sequencing data for bacteria. It is primarily used to analyze laboratory evolution experiments with microbes. In these experiments, there is usually a high-quality reference genome for the ancestral strain, and one is interested in exhaustively finding all of the mutations that occurred during the evolution experiment. Then one might want to construct a phylogenetic tree of individuals samples from a single population or determine whether the same gene is mutated in many independent evolution experiments in an environment.

Input data / expectations:

  • Haploid reference genome
  • Relatively small (<20 Mb) reference genome
  • Input FASTQ reads can be from any sequencing technology
  • Average genomic coverage > 30-fold
  • Less than ~1,000 mutations expected
  • Detects SNVs and SVs from single-end reads (does not use paired-end distance information)
  • Produces annotated HTML output

You can learn a great deal more about breseq by reading the Online Documentation.

Here is a rough outline of the workflow in breseq with proposed additions.

This tutorial was reformatted from the most recent version found here. Our thanks to the previous instructors.

Objectives:

  • Use a very self contained/automated pipeline to identify mutations.
  • Explain the types of mutations found in a complete manner before using methods better suited for higher order organisms.

 

Bacteriophage lambda data set

First, we'll run breseq on a small data set to be sure that it is installed correctly, and to get a taste for what the output looks like. This sample is a mixed population of bacteriophage lambda that was co-evolved in lab with its E. coli hosts.

Environment

In order to run breseq, we need to make sure breseq was made available to you when we set up your .bashrc file on the first day.

Check that you have access to breseq
tacc:~$ which breseq
/corral-repl/utexas/BioITeam/breseq/bin/breseq

breseq should now run using the breseq command. breseq by itself will show you what the command expectations are. Not all programs are configured to tell you what it expects just from typing the name of it. Some require the name of the command followed by one of the following: -h or --help or ? while others require preceding the command name with "man" (short for manual). If all that fails google is your friend for all programs not named "R" ... google is still your best bet, but it won't be your friend. In the specific case of R, adding the word stat somewhere to the search will greatly help things.

Data

The data files for this tutorial is located in following location:

/corral-repl/utexas/BioITeam/ngs_course/lambda_mixed_pop/data/

Copy the contents of this directory to a new directory called BDIB_breseq_lambda_mixed_pop in your scratch directory.

Click here for the solution
mkdir $SCRATCH/BDIB_breseq_lambda_mixed_pop
cp /corral-repl/utexas/BioITeam/ngs_course/lambda_mixed_pop/data/* $SCRATCH/BDIB_breseq_lambda_mixed_pop

Now use the ls command to see what files were copied:

File Name

Description

Sample

lambda_mixed_population.fastq

Single-end Illumina 36-bp reads

Evolved lambda bacteriophage mixed population genome sequencing

lambda.gbk

Reference Genome

Bacteriophage lambda


Running breseq

Because this data set is relatively small (roughly 100x coverage of a 48,000 bp genome), a breseq run will take < 5 minutes, but it is computationally intense enough that it should not be run on the head node. By now this should be somewhat familiar, but incase its not expand the following.

Idev command
# if running on Tuesday:
idev  -m 120 -r CCBB_5.23.17PM -A UT-2015-05-18

# if running on Wendesday:
idev  -m 120 -r CCBB_5.24.17PM -A UT-2015-05-18
breseq command
cd $SCRATCH/BDIB_breseq_lambda_mixed_pop
module load intel
module load Rstats
breseq -j 48 -r lambda.gbk lambda_mixed_population.fastq &> log.txt &

A bunch of progress messages will stream by during the breseq run which would clutter the screen if not for the redirection to the log.txt file. The & at the end of the line tells the system to run the previous command in the background which will enable you to still type and execute other commands while breseq runs. The output text details several steps in a pipeline that combines the steps of mapping (using SSAHA2), variant calling, annotating mutations, etc. You can examine them by peeking in the log.txt file as your job runs using tail log.txt. While breseq is running lets look at what the different parts of the command are actually doing:

partpuprose
-j 48Use 48 processors (the max available on lonestar5 nodes)
-r lambda.gbkUse the lambda.gbk file as the reference to identify specific mutations
lambda_mixed_population.fastqbreseq assumes any argument not preceded by a - option to be an input fastq file to be used for mapping
&> log.txtredirect the output and error to the file log.txt
&run the preceding command in the background

This will finish very quickly (likely before you begin reading this) with a final line of "Creating index HTML table...". check this using the tail command.

Looking at breseq predictions

breseq produced a lot of directories beginning 01_sequence_conversion02_reference_alignment, ... Each of these contains intermediate files that can be deleted when the run completes, or explored if you are interested in the inner guts of what is going on. More importantly, breseq will also produce two directories called: data and output which contain files used to create .html output files and .html output files respectively. The most interesting files are the .html files which can't be viewed directly on lonestar. Therefore we first need to copy the output directory back to your desktop computer. Go back to the first tutorial (BDIB_breseq_tutorial_1) and transfer the contents of the output directory back to your local computer.

 

To use scp you will need to run it in a terminal that is on your desktop and not on the remote TACC system. It can be tricky to figure out where the files are on the remote TACC system, because your desktop won't understand what $HOME, $WORK, $SCRATCH mean (they are only defined on TACC).

To figure out the full path to your file, you can use the pwd command in your terminal on TACC in the window that you ran breseq in (it should contain an "output" folder). Rather than copying the entire contents of the folder which can be rather large, we are going to add a twist of compressing the entire folder into a single compressed archive using the tar command so that the size will be smaller and it will transfer faster:

Command to type in TACC
cd $SCRATCH/BDIB_breseq_tutorial_1
tar -czvf output.tar.gz output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd

Then you can then copy paste that information (in the correct position) into the scp command on the desktop's command line:

Command to type in the desktop's terminal window
scp -r <username>@ls5.tacc.utexas.edu:<the_directory_returned_by_pwd>/output.tar.gz .
 
# Enter your password and Token number and wait for the file transfer to complete
 
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract 

Navigate to the output directory in the finder and open the a file called index.html. This will open the results in a web browser window that you can click through different mutations and other information and see the evidence supporting it. The summary page provides useful information about the percent of reads mapping to the genome as well as the overall coverage of the genome. The Mutation Predictions page is where most of the analysis time is spent in determining which mutations are important (and more rarely inaccurate).

Click around through the different mutations and examine their evidence to see what kinds of mutations you can identify. Interact with your instructors, and show us what different types of mutations you are able to identify, or ask us what mutations you don't understand. Additional information on analyzing the output can be found at the following reference:

  • Deatherage, D.E.Barrick, J.E.. (2014) Identification of mutations in laboratory-evolved microbes from next-generation sequencing data using breseqMethods Mol. Biol. 1151:165-188. «PubMed»

 

Examining breseq results

Exercise: Can you figure out how to archive all of the output directories and copy only those files (and not all of the very large intermediate files) back to your machine? - without deleting any files?

You will want to use the tar command again, but you will need to use a wildcard to specify what goes into the compressed file, and only the output directories within each of the wildcard-matched directories.

click here to check your solution, or get the answer
tar -cvzf output.tgz output_*/output

To use scp you will need to run it in a terminal that is on your desktop and not on the remote TACC system. It can be tricky to figure out where the files are on the remote TACC system, because your desktop won't understand what $HOME, $WORK, $SCRATCH mean (they are only defined on TACC).

To figure out the full path to your file, you can use the pwd command in your terminal on TACC in the window that you ran breseq in (it should contain an "output" folder). Rather than copying the entire contents of the folder which can be rather large, we are going to add a twist of compressing the entire folder into a single compressed archive using the tar command so that the size will be smaller and it will transfer faster:

Command to type in TACC
tar -czvf output.tar.gz output_*/output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd

Then you can then copy paste that information (in the correct position) into the scp command on the desktop's command line:

Command to type in the desktop's terminal window
scp -r <username>@ls5.tacc.utexas.edu:<the_directory_returned_by_pwd>/output.tar.gz .
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract 

 

Click around in the results and see the different types of mutations you can detect.

  • No labels