Overview

Most approaches for predicting structural variants require you to have paired-end or mate-pair reads. They use the distribution of distances separating these reads to find outliers and also look at pairs with incorrect orientations. As mentioned during several of the presentations, many researchers choose to ignore these types of mutations and combined with the increased difficulty of accurately identifying them, the community is less settled on the "best" way to analyze them. Here we present a tutorial on a somewhat older program SVDetect. SVDetect is a type of program that makes use of configuration files rather than command line options (something you may encounter with other programs in your own work).

Other possible tools:

BreakDancer - hard to install prerequisites on TACC. Requires installing libgd and the notoriously difficult GD Perl module.
PEMer - hard to install prerequisites on TACC. Requires "ROOT" package.

Good discussion of some of the issues of predicting structural variation:

http://www.genome.gov/Pages/Research/DER/1000GenomesProjectTutorials/StructuralVariants-JanKorbel.pdf

Comparison of many different SV tools

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1828-7

Learning objectives:

Identify structural variants in a new data set.
Work with a new type of program that uses configuration files rather than entering all information on a single command at the command line. This is very similar to the queue system TACC uses making this a good introduction.

Calling SV with SVdetect:

Here we'll look an E. coli genome re-sequencing sample where a key mutation producing a new structural variant was responsible for a new phenotype involving citrate, something the Barrick lab has studied.

Prepare your directories

cds
cp -r $BI/gva_course/structural_variation/data GVA_sv_tutorial
cd GVA_sv_tutorial

This is Illumina mate-paired data (having a larger insert size than paired-end data) from genome re-sequencing of an E. coli clone.

File Name	Description	Sample
`61FTVAAXX_2_1.fastq`	Paired-end Illumina, First of mate-pair, FASTQ format	Re-sequenced E. coli genome
`61FTVAAXX_2_2.fastq`	Paired-end Illumina, Second of mate-pair, FASTQ format	Re-sequenced E. coli genome
`NC_012967.1.fasta`	Reference Genome in FASTA format	E. coli B strain REL606
NC_012967.1.lengths	Simple tab delimtered file based on the size of the reference needed for SVDetect so you don't have to create it yourself

Map data using bowtie2

First we need to (surprise!) map the data. This will hopefully reinforce the bowtie2 tutorial you just completed.

Use hostname to verify you are still on the idev node.

If not, and you need help getting a new idev node, see this tutorial.

conda activate GVA-bowtie2-mapping
bowtie2-build NC_012967.1.fasta NC_012967.1
bowtie2 -t -p 64 -X 5000 --rf -x NC_012967.1 -1 61FTVAAXX_2_1.fastq -2 61FTVAAXX_2_2.fastq -S 61FTVAAXX.sam

Possibly unfamiliar options:

--rf tells bowtie2 that your read pairs are in the "reverse-forward" orientation of a mate-pair library
-X 5000 tells bowtie2 to not mark read pairs as discordant unless their insert size is greater than 5000 bases.

You may notice that these commands complete pretty quickly. Always remember speed is not necessarily representative of how taxing something is for TACC's head node, and always try to be a good TACC citizen and do as much as you can on idev nodes or as job submissions

Install SVDetect

This will be the most complicated installation yet. In addition to needing to install several different programs in the same conda installation command, we will need to install perl modules through cpan. Unfortunately, the cpan network can not be accessed through the compute nodes, so you must log out of your idev session using the logout command before continuing. If you are unsure if you are in an idev session remember you can use the hostname command to check.

conda installation

Like we saw in our samtools installation, we will need to install several programs at the same time to make sure they are all going to work with each other. In addition, we are going to create a new environment for working with SVDetect as some of the dependencies of SVDetect clash with those of samtools.

conda create --name GVA-SV -c bioconda -c conda-forge -c imperial-college-research-computing _libgcc_mutex perl libgcc-ng svdetect

conda activate GVA-SV

cpan module installations

If you attempt to launch cpan, you likely get a message similar to the following:

/home1/0004/train402/miniconda3/envs/svdetect/bin/perl: symbol lookup error: /home1/apps/bioperl/1.007002/lib/perl5/x86_64-linux-thread-multi/auto/version/vxs/vxs.so: undefined symbol: Perl_xs_apiversion_bootcheck

which -a cpan

~/miniconda3/envs/SVDetect/bin/cpan
/usr/bin/cpan

While just typing cpan the first location was used and we saw it didn't work as we had hoped. We can explicitly launch the 2nd location by using the full path to the executable file on the prompt

/usr/bin/cpan

In the following block note that each elipse will include large blocks of scrolling text as different modules are downloaded and installed. The process will take several minutes in total, just be ready to execute the next command when you get the cpan prompt back.

# choose 'yes' to do as much automatically as possible 
# choose 'local::lib' for the approach you want (as you don't have admin rights on TACC)
# choose 'yes' to automatically choose some CPAN mirror sites for you
...
# choose 'yes' to append the information to your .bashrc file
...
cpan[1]> install Config::General
...
cpan[2]> install Tie::IxHash
...
cpan[3]> install Parallel::ForkManager
...
cpan[4]> quit

Several students were having trouble with the cpan downloads that seem to be related to some kind of interruption in the initial download process. The following commands have solved the issue for at least 1 student. Please try the following if you were unable to get the above cpan downloads to work, and let me know if you continue to experience difficulties.

/usr/bin/cpan

o conf init

This should then get you back to the point where you can:

# choose 'yes' to do as much automatically as possible 
# choose 'local::lib' for the approach you want (as you don't have admin rights on TACC)
# choose 'yes' to automatically choose some CPAN mirror sites for you
...
# choose 'yes' to append the information to your .bashrc file
...
cpan[1]> install Config::General
...
cpan[2]> install Tie::IxHash
...
cpan[3]> install Parallel::ForkManager
...
cpan[4]> quit

The above solution is based on steps 4-7 of this page. Again, this installation is known to be difficult, if you are having problems, please let me know.

Just before getting your first cpan prompt, there is a block of text that looks something like this

PATH="/home1/0004/train402/perl5/bin${PATH:+:${PATH}}"; export PATH;
PERL5LIB="/home1/0004/train402/perl5/lib/perl5${PERL5LIB:+:${PERL5LIB}}"; export PERL5LIB;
PERL_LOCAL_LIB_ROOT="/home1/0004/train402/perl5${PERL_LOCAL_LIB_ROOT:+:${PERL_LOCAL_LIB_ROOT}}"; export PERL_LOCAL_LIB_ROOT;
PERL_MB_OPT="--install_base \"/home1/0004/train402/perl5\""; export PERL_MB_OPT;
PERL_MM_OPT="INSTALL_BASE=/home1/0004/train402/perl5"; export PERL_MM_OPT;

Would you like me to append that to /home1/0004/train402/.bashrc now? [yes]

The answer here is yes. What is being asked is if you would like the computer by default to be able to access these new perl 'lib' (libraries) you have created, and if you want perl binaries in your PATH. Recall that the PATH variable is where the command line searches when you enter a command so that you don't have to specify its location from the root directory. Similar is true of the PERL_LOCAL_LIB_ROOT and other variables except instead of being searched from the command line, the perl program searches them when commands inside the perl script are accessed. While there are ways to specify how to access these libraries when running individual commands, that is going to be much more complicated at a minimum, and may require editing scripts or programs.

Once you quit cpan, you will get a message to restart your shell. Since you are on a remote computer, you can accomplish the same thing by logging out of TACC and sshing back in.

If the above bold letters are not enough of a clue for what you need to do here (and/or where you need to go to find appropriate minitutorials), now is a good time to start thinking about what question you need to be asking or sending in an email. It is ok to be overwhelmed or lost especially with the class being virtual and not being able to get good feedback from me directly on your progress. I am happy too help, but can only do so if I know you are struggling.

Once you have logged back in, be sure to restart a new idev session, and activate your SVDetect conda environment.

Analyze read mapping distribution

The first step is to look at all mapped read pairs and whittle down the list only to those that have an unusual insert sizes (distances between the two reads in a pair).

cd $SCRATCH/GVA_sv_tutorial
conda activate GVA-SV
BAM_preprocessingPairs.pl -p 0 61FTVAAXX.sam &> preprocessing_results.txt

As we discussed in our earlier presentation, SV are often detected by looking for variations in library insert sizes. Look at the preprocessing_results.txt file created by the pearl script will answer the questions:

-- using -1142.566-5588.410 as normal range of insert size
Approximately 20% based on:
-- 994952 mapped pairs
---- 195705 abnormal mapped pairs
Approximately 0.5% based on:
-- Total : 1000000 pairs analysed
-- 5048 pairs whose one or both reads are unmapped

Possible things are:

The first answer can tell you what type of library it is if you did not know ahead of time (remember paired end reads have ~500-700bp inserts on average not 1000s of bp)
This should help underscore that a significant portion of your total reads (and thus variation) may be in structural variants. Unfortunately, this does require generating a mate-pair library to learn.
Low levels of unmapped read pairs suggests that both the reference is accurate, of a high quality, and free of contamination.
1. Note that contamination in this case refers only to other organisms, and adapter sequences, not other samples.

Running SVDetect

SVDetect demonstrates a common strategy in some programs with complex input where instead of including a lot of options on the command line, it reads in a simple text file that sets all of the required options. Lets look at how to create a configuration file:

Notice the next block contains line numbers. On lines 7 and 8 you see ##### and <USERNAME> ... these need to correspond to your scratch directory locations. You can easily check this with the pwd command.

Do not change anything else on the line.

If you are unsure what you should be replacing those place holders with please get my attention on zoom and I'll help you through it.

<general>
input_format=sam
sv_type=all
mates_orientation=RF
read1_length=35
read2_length=35
mates_file=/scratch/#####/<USERNAME>/GVA_sv_tutorial/61FTVAAXX.ab.sam
cmap_file=/scratch/#####/<USERNAME>/GVA_sv_tutorial/NC_012967.1.lengths
num_threads=48
</general>

<detection>
split_mate_file=0
window_size=5000
step_length=2500
</detection>

<filtering>
split_link_file=0
nb_pairs_threshold=7
strand_filtering=1
insert_size_filtering=1
mu_length=unknown1
sigma_length=unknown2
order_filtering=1
nb_pairs_order_threshold=5
</filtering>

<bed>
  <colorcode>
    255,0,0=1,4
    0,255,0=5,10
    0,0,255=11,100000
  </colorcode>
</bed>

Notice that lines 23 and 24 are listed as unknown1 and unknown2 respectively. While several of the other values are related to the actual library, these 2 values are actually critical to enabling and guiding the analysis (pick too small and even normal reads appear to support structural variants, pick too large and even variant reads appear as normal reads). Luckily these values are actually empirically determined as part of the preprocessing_results.txt file that was created with the perl script.

-- mu length = 2223, sigma length = 1122

You can either look up the values and replace unknown1 and unknown2 with 2223 and 1122 respectively

mu_sigma=$(grep "^-- mu length = [0-9]*, sigma length = [0-9]*$" preprocessing_results.txt | \
sed -E 's/-- mu length = ([0-9]+), sigma length = ([0-9]+)/\1,\2/'); \
mu=$(echo $mu_sigma | cut -d "," -f 1); \
sigma=$(echo $mu_sigma | cut -d "," -f 2); \
sed -i -E "s/mu_length=unknown1/mu_length=$mu/" svdetect.conf; \
sed -i -E "s/sigma_length=unknown2/sigma_length=$sigma/" svdetect.conf

Note that the above is considered a '1-liner' even though it is spread across 6 lines for clarity. When \ is encountered at the end of the line the command line waits until it reaches the end of a line without finding a "\" before executing anything. As the following commands will take a few minutes each and must be completed in order, launch them in order, but read ahead as to what values in our svdetect.conf file are critical to analysis, and why we picked the values we did.

SVDetect linking -conf svdetect.conf
SVDetect filtering -conf svdetect.conf
SVDetect links2SV -conf svdetect.conf

Each command will finish faster than the one before it with the longest period with no advancement seeming to be after "# Defining precise link coordinates..." is printed to the screen. While waiting:

option	value	reason
mu_length	2223	this is taken directly from the preprocessing perl script, it enables classification of the different discordant read mappings
sigma_length	1122	Again taken from the preprocessing perl script, it enables classification of the different discordant read mappings
window_size	5000	Must be set to at least 2mu + (2sigma)^0.5to enable balance classification (4540 would be minimum value in this case)
step_length	2500	Documentation suggests this be 25-50% of the window size, smaller sizes enable more precise endpoint determination.
strand; insert size; ordering_filtering	1	Setting each to 1 turns on such filtering, absent being listed, or being listed as 0, these tests would not run

You may also consult the manual for a full description of what the commands and options inside of the svdetect.conf file.

First lets break this one liner down into its 6 parts:

mu_sigma=$(grep "^-- mu length = [0-9]*, sigma length = [0-9]*$" preprocessing_results.txt | \
sed -E 's/-- mu length = ([0-9]+), sigma length = ([0-9]+)/\1,\2/'); \
mu=$(echo $mu_sigma | cut -d "," -f 1); \
sigma=$(echo $mu_sigma | cut -d "," -f 2); \
sed -i -E "s/mu_length=unknown1/mu_length=$mu/" svdetect.conf; \
sed -i -E "s/sigma_length=unknown2/sigma_length=$sigma/" svdetect.conf

Line 1 uses grep on the preprocessing_results.txt file looking for any line that matches "^-- mu length = [0-9]*, sigma length = [0-9]*$" and stores it in a variable mu_sigma.
1. mu_sigma=$(......) stores everything between the () marks in a command line variable named mu_sigma ... you should notice that the closing ) mark is actually on line 2
2. the | at the end of the line is the "pipe" command which passes the output to the next command (line in this case as we have broken the command up into parts)
Line 2 uses the sed command to delete everything that is not a number or , and finishes storing the output in the mu_sigma variable
1. sed commands can be broken down as follows: 's/find/replace/'
  1. in this case, find:
    1. -- mu length = ([0-9]+), sigma length = ([0-9]+)
    2. where :
      1. [0-9] is any number
      2. the + sign means find whatever is to the left 1 or more times
      3. and things between () should be remembered for use in the replace portion of the command
  2. likewise, in this case, replace:
    1. \1,\2/
    2. where
      1. \1 means whatever was between the first set of () marks in the find portion
      2. , is a literal comma
      3. \2 means whatever was between the sescond set of () marks in the find portion
2. at the end of line 2 we now have a new variable named mu_sigma with a value of "2223,1122"
Line 3 creates a new variable named mu and gives it the value of whatever is to the left of the first , it finds.
1. echo $mu_sigma |
  1. pass the value of $mu_sigma to whatever is on the other side of |
2. cut -d "," -f 1
  1. divide whatever the cut command sees at all the "," marks and then print whatever is to the left of the 1st
3. at the end of line 3 we now have a variable named mu with the value "2223"
Line 4 does the same thing as line 3 except for a variable named sigma, and takes whatever is between the 2nd comma and 3rd comma (since we only have 1 comma, its taking whatever comes after the comma)
1. at the end of line 4 we now have a variable named sigma with the value "1122"
Line 5 looks through the entire svdetect.conf file looking for a line that matches mu_length=unknown1 and replaces all that text with mu_length=$mu (except the computer knows $mu is the variable with the value 2223.
1. the -i option tells the sed command to do the replacement in place meaning you are changing the contents of the file
2. the "" marks tell the command line that you want to evaluate whatever is between the ""marks, in this case, the mu variable
3. at the end of line 5, our svdetect.conf file line 23 now reads mu_length=2223
Line 5 looks through the entire svdetect.conf file looking for a line that matches sigma_length=unknown2 and replaces all that text with sigma_length=$sigma (except the computer knows $sigma is the variable with the value 1122.
1. the -i option tells the sed command to do the replacement in place meaning you are changing the contents of the file
2. the "" marks tell the command line that you want to evaluate whatever is between the ""marks, in this case, the sigma variable
3. at the end of line 5, our svdetect.conf file line 24 now reads sigma_length=1122

If you decide that SVdetect is the program you would like to use for calling your structural variants the following 1 line command will become invaluable:

SVDetect linking -conf svdetect.conf && SVDetect filtering -conf svdetect.conf && SVDetect links2SV -conf svdetect.conf && echo "SVDetect finished without Errors"

Note that because it is the svdetect.conf file that contains the sample specific information, this is the only command you will ever need for running SVDetect (you will still need to run the preprocessing script and edit the svdetect.conf file for each sample).

The use of the double ampersand (&&) has a special meaning working with piping commands in bash. Specifically it means only do the next command if the previous command finished without errors. By sticking the echo command at the very end you get an easy indicator telling you all the steps of the program finished correctly.

Take a look at the final output file: 61FTVAAXX.ab.sam.links.filtered.sv.txt. Another downside of command line applications is that while you can print files to the screen, the formatting is not always the nicest. On the plus side in 95% of cases, you can directly copy the output from the terminal window to excel and make better sense of what the columns actually are. Unfortunately, this is not one of those times. On Friday I'll explain how I work around this.

I've copied a few of the lines after pasting into excel below :

chr_type	SV_type	BAL_type	chromosome1	start1-end1	average_dist	chromosome2	start2-end2	nb_pairs	score_strand_filtering	score_order_filtering	score_insert_size_filtering	final_score	breakpoint1_start1-end1	breakpoint2_start2-end2
INTRA	UNDEFINED	UNBAL	chrNC_012967	624987-629995	305	chrNC_012967	624988-629996	3201	82%	100%	98%	0.821	624305-624987	629996-630678
INTRA	UNDEFINED	UNBAL	chrNC_012967	624699-627533	340	chrNC_012967	624988-628769	1889	85%	100%	98%	0.835	621843-624699	628769-630678
INTRA	UNDEFINED	UNBAL	chrNC_012967	625953-629995	321	chrNC_012967	627473-630044	1646	83%	100%	97%	0.817	624305-625953	630044-633163
INTRA	LARGE_DUPLI	UNBAL	chrNC_012967	599566-602498	63831	chrNC_012967	662625-666126	658	100%	100%	100%	1	596808-599566	666126-668315
INTRA	LARGE_DUPLI	UNBAL	chrNC_012967	599966-602869	63804	chrNC_012967	663105-666126	512	100%	100%	100%	1	597179-599966	666126-668795
INTRA	LARGE_DUPLI	UNBAL	chrNC_012967	3-2025	4627075	chrNC_012967	4626530-4629804	436	100%	99%	100%	0.995	3-Jan	4629804-4629812
INTRA	INVERSION	UNBAL	chrNC_012967	17471-20179	2757879	chrNC_012967	2774440-2777242	237	100%	100%	-	1	14489-17471	2771552-2774440

The first 5 lines all refer to the same general region from approximately 600000 to 663000 (as do several others lower in the file). This is in fact a head to tail duplication involving citrate utilization. One of the reasons for the multiple lines referring to the same event is that the amplification itself is nested with slight differences in start/stop locations.
Line 6 is just the origin of the circular chromosome, connecting its end to the beginning! Some programs allow for flags to denote a sequence as circular to avoid reporting things like this, SVdetect is not one of them.
The last line is listed as big chromosomal inversion (possibly mediated by recombination between repeated sequence such as IS elements in the genome). It was only detected because the insert size of the library was > ~1,500 bp!
Alternatively, the last line (or certainly inversions listed lower in the file) may be a it may be a mobile element jumping into a new location in the genome. 2 key reasons for thinking this may be the case, are 1 there are a lot of inversions listed, but no insertions or translocations, and this is an E coli genome known to have highly active mobile elements. 2 we discussed what MAPQ scores mean in terms of repetitive elements, yet have done no filtering for such elements, and while the inversion listed above has a high number of read pairs supporting it, many of the others do not as you would expect for only some of the copies being randomly assigned among any of the appropriate locations.

Return to GVA2022 course page.