Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Notice that lines 23 and 24 are listed as unknown1 and unknown2 respectively. This information is actually While several of the other values are related to the actual library, these 2 values are actually critical to enabling and guiding the analysis (pick too small and even normal reads appear to support structural variants, pick too large and even variant reads appear as normal reads). Luckily these values are actually empirically determined as part of the preprocessing_results.txt file that was created with the perl script.

...


  • Expand
    titleCritical values and their choice


    optionvaluereason
    mu_length2223this is taken directly from the preprocessing perl script, it enables classification of the different discordant read mappings
    sigma_length1122Again taken from the preprocessing perl script, it enables classification of the different discordant read mappings
    window_size5000Must be set to at least 2mu + (2sigma)0.5 to enable balance classification (4540 would be minimum value in this case)
    step_length2500Documentation
    suggest 
    suggests this be 25-50% of the window size, smaller sizes enable more precise endpoint determination.
    strand; insert size; ordering_filtering1Setting each to 1 turns on such filtering, absent being listed, or being listed as 0, these tests would not run

    You may also  consult the manual for a full description of what the commands and options inside of the svdetect.conf file.



  • Expand
    titleClick here for a breakdown of what this one liner is doing

    First lets break this one liner down into its 6 parts:

    Code Block
    linenumberstrue
    mu_sigma=$(grep "^-- mu length = [0-9]*, sigma length = [0-9]*$" preprocessing_results.txt | 
    sed -E 's/-- mu length = ([0-9]+), sigma length = ([0-9]+)/\1,\2/');
    mu=$(echo $mu_sigma | cut -d "," -f 1);
    sigma=$(echo $mu_sigma | cut -d "," -f 2);
    sed -i -E "s/mu_length=unknown1/mu_length=$mu/" svdetect.conf;
    sed -i -E "s/sigma_length=unknown2/sigma_length=$sigma/" svdetect.conf
    1. Line 1  uses grep on the preprocessing_results.txt file looking for any line that matches "^-- mu length = [0-9]*, sigma length = [0-9]*$" and stores it in a variable mu_sigma.
      1. mu_sigma=$(......) stores everything between the () marks in a command line variable named mu_sigma ... you should notice that the closing ) mark is actually on line 2
      2. the | at the end of the line is the "pipe" command which passes the output to the next command (line in this case as we have broken the command up into parts)
    2. Line 2 uses the sed command to delete everything that is not a number or , and finishes storing the output in the mu_sigma variable
      1. sed commands can be broken down as follows: 's/find/replace/'
        1. in this case, find:
          1. -- mu length = ([0-9]+), sigma length = ([0-9]+)
          2. where :
            1. [0-9] is any number
            2. the + sign means find whatever is to the left 1 or more times
            3. and things between () should be remembered for use in the replace portion of the command
        2. likewise, in this case, replace:
          1. \1,\2/
          2. where
            1. \1 means whatever was between the first set of () marks in the find portion
            2. , is a literal comma
            3. \2 means whatever was between the sescond set of () marks in the find portion
      2. at the end of line 2 we now have a new variable named mu_sigma with a value of "2223,1122"
    3. Line 3 creates a new variable named mu and gives it the value of whatever is to the left of the first , it finds.
      1. echo $mu_sigma |
        1. pass the value of $mu_sigma to whatever is on the other side of |
      2. cut -d "," -f 1
        1. divide whatever the cut command sees at all the "," marks and then print whatever is to the left of the 1st  
      3. at the end of line 3 we now have a variable named mu with the value "2223"
    4. Line 4 does the same thing as line 3 except for a variable named sigma, and takes whatever is between the 2nd comma and 3rd comma (since we only have 1 comma, its taking whatever comes after the comma)
      1. at the end of line 4 we now have a variable named sigma with the value "1122"
    5. Line 5 looks through the entire svdetect.conf file looking for a line that matches mu_length=unknown1 and replaces all that text with mu_length=$mu (except the computer knows $mu is the variable with the value 2223.
      1. the -i option tells the sed command to do the replacement in place meaning you are changing the contents of the file
      2. the "" marks tell the command line that you want to evaluate whatever is between the ""marks, in this case, the mu variable
      3. at the end of line 5, our svdetect.conf file line 23 now reads mu_length=2223
    6. Line 5 looks through the entire svdetect.conf file looking for a line that matches sigma_length=unknown2 and replaces all that text with sigma_length=$sigma (except the computer knows $sigma is the variable with the value 1122.
      1. the -i option tells the sed command to do the replacement in place meaning you are changing the contents of the file
      2. the "" marks tell the command line that you want to evaluate whatever is between the ""marks, in this case, the sigma variable
      3. at the end of line 5, our svdetect.conf file line 24 now reads sigma_length=1122


...

I've copied a few of the lines after pasting into excel below :

chr_typeSV_typeBAL_typechromosome1start1-end1average_distchromosome2start2-end2nb_pairsscore_strand_filteringscore_order_filteringscore_insert_size_filteringfinal_scorebreakpoint1_start1-end1breakpoint2_start2-end2
INTRAUNDEFINEDUNBALchrNC_012967624987-629995305chrNC_012967624988-629996320182%100%98%0.821624305-624987629996-630678
INTRAUNDEFINEDUNBALchrNC_012967624699-627533340chrNC_012967624988-628769188985%100%98%0.835621843-624699628769-630678
INTRAUNDEFINEDUNBALchrNC_012967625953-629995321chrNC_012967627473-630044164683%100%97%0.817624305-625953630044-633163
INTRALARGE_DUPLIUNBALchrNC_012967599566-60249863831chrNC_012967662625-666126658100%100%100%1596808-599566666126-668315
INTRALARGE_DUPLIUNBALchrNC_012967599966-60286963804chrNC_012967663105-666126512100%100%100%1597179-599966666126-668795
INTRALARGE_DUPLIUNBALchrNC_0129673-20254627075chrNC_0129674626530-4629804436100%99%100%0.9953-Jan4629804-4629812
INTRAINVERSIONUNBALchrNC_01296717471-201792757879chrNC_0129672774440-2777242237100%100%-114489-174712771552-2774440


Expand
titleAny idea what sorts of mutations produced these three structural variants?
  • The first 5 lines all refer to the same general region from approximately 600000 to 663000 (as do several others lower in the file). This is in fact a head to tail duplication involving citrate utilization. One of the reasons for the multiple lines referring to the same event is that the amplification itself is nested with slight differences in start/stop locations.
  • Line 6 is just the origin of the circular chromosome, connecting its end to the beginning! Some programs allow for flags to denote a sequence as circular to avoid reporting things like this, SVdetect is not one of them.
  • The last line is listed as big chromosomal inversion (possibly mediated by recombination between repeated sequence such as  IS elements in the genome). It was only detected because the insert size of the library was > ~1,500 bp!
  • AlternitivelyAlternatively, the last line (or certainly inversions listed lower in the file) may be a  it may be a mobile element jumping into a new location in the genome. 2 key reasons for thinking this may be the case, are 1 there are a lot of inversions listed, but no insertions or translocations, and this is an E coli genome known to have highly active mobile elements. 2 we discussed what MAPQ scores mean in terms of repetitive elements, yet have done no filtering for such elements, and while the inversion listed above has a high number of read pairs supporting it, many of the others do not as you would expect for only some of the copies being randomly assigned among any of the appropriate locations.

...