Most approaches for predicting structural variants require you to have paired-end or mate-pair reads. They use the distribution of distances separating these reads to find outliers and also look at pairs with incorrect orientations.
- BreakDancer - hard to install prerequisites on TACC. Requires installing libgd and the notoriously difficult GD Perl module.
- PEMer - hard to install prerequisites on TACC. Requires "ROOT" package.
- SVDetect - good instructions, relatively hefty configuration files.
Good discussion of some of the issues of predicting structural variation:
Navigate to the SVDetect project page
Try to download the code yourself onto TACC.
Move the Perl scripts and make them executable
Install required Perl modules
SVdetect requires a few Perl modules to be installed. In the default TACC environment, you can use the cpan shell to install most well-behaved Perl modules (with the exception of some complicated ones that require other libraries to be installed or things to compile). Here's how:
Here's an E. coli genome re-sequencing sample where a key mutation producing a new structural variant was responsible for a new phenotype.
This is Illumina mate-paired data (having a larger insert size than paired-end data) from genome re-sequencing of an E. coli clone.
Paired-end Illumina, First of mate-pair, FASTQ format
Re-sequenced E. coli genome
Paired-end Illumina, Second of mate-pair, FASTQ format
Re-sequenced E. coli genome
Reference Genome in FASTA format
E. coli B strain REL606
Map data using BWA
You should submit the
bwa aln and
bwa sampe commands as jobs to the queue, one after the other.
Possibly unfamiliar options:
-n 0tells bwa to report zero pairs for proper mates
-N 100tells bwa to report at most 100 possible matches for mates with abnormal distances or orientations.
If you use bowtie to do your mapping, you won't predict any read SVs. Why?
bowtie doesn't map discordant pairs!
The first step is to look at all mapped read pairs and whittle down the list only to those that have an unusual insert sizes (distances between the two reads in a pair). You should submit this command to the TACC queue.
What is the normal insert size for this library? (Check stdout from the command.)
SVDetect demonstrates a common strategy in some programs with complex input where instead of including a lot of options on the command line, it reads in a simple text file that sets all of the required options.
Create a configuration file:
You'll need to substitute your own paths for
You also need to create a tab-delimited file of chromosome lengths.
You'll want to submit the first two commands to the TACC queue. They take a while.
Consult the manual for a full description of what these commands and options are doing.
Take a look at the resulting file:
We've highlighted a few lines below:
Any idea what sorts of mutations produced these three structural variants?
1. This is a tandem head-to-tail duplication of the region from approximately 600000 to 663000.
2. This is just the origin of the circular chromosome, connecting it's end to the beginning!
3. This is a big chromosomal inversion mediated by recombination between repeated IS elements in the genome. It would not have been detected if the insert size of the library wasn't > ~1,500 bp!
Very, Very Advanced Exercise
- SVDetect has a nice option to output a file that can be read by Circos to produce drawings.