Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Note

The discussion of concepts in this tutorial are identical to the advanced mapping with bowtie2 tutorial and work with the same data. The commands in this tutorial will take ~15 minutes. Discussion of results is different in both tutorials.more thorough in the advanced mapping tutorial. As that tutorial directly assess the difference between multiple mapping types.

Warning
titlePrerequisite required

In order for this tutorial to work, you must have installed your own version of bowtie2 in this tutorial. If you have not the commands below will not run.

To verify these commands will work, use the 'which bowtie2' command and verify that bowtie2 resides in your $WORK directory not /opt


Learning objectives

  1. Understand how mapping against multiple reference genomes simultaneously decreases noise and increases confidence in variant calls.
  2. Call variants in multiple references simultaneously in breseq.

...

Code Block
languagebash
mkdir $SCRATCH/GVA_advanced_mapping_breseq
cp $BI/gva_course/novel_DNA/* $SCRATCH/GVA_advanced_mapping_breseq
cp $BI/gva_course/advanced_mapping/* $SCRATCH/GVA_advanced_mapping_breseq
cd $SCRATCH/GVA_advanced_mapping_breseq
ls

^ expected 2 fastq files and 2 gbk reference files

Set up the run

Hopefully by now, it will not surprise you that we will have a 3 step process: converting references to fasta, indexing the references, and mapping the reads. If it does recall the read mapping tutorial, and the novel DNA identification tutorial for additional information and examples. Here, less description of individual steps will be given.

Code Block
languagebash
titleConvert reference to fasta
module load bioperl
bp_seqconvert.pl --from genbank --to fasta < CP009273.1_Eco_BW25113.gbk > CP009273.1_Eco_BW25113.fasta
bp_seqconvert.pl --from genbank --to fasta < GFP_Plasmid_SKO4.gbk > GFP_Plasmid_SKO4.fasta

...

titleRecall that fasta files can have multiple sequence entries

...

Minimal information supplied about breseq here, please see advanced breseq tutorial for more information.


Warning
titleRemember to make sure you are on an idev done

For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

Code Block
languagebash
title

...

As a challenge consider changing this command using the information from the advanced breseq tutorial to redirect the output to a log file and put the output in a subfolder that would allow you to make use of the helper script export-breseq and for loop for pulling .gd files to generate a compare table.
module unload bowtie/2.3.4

...

breseq -

...

j 48 -r CP009273.1_Eco_BW25113.

...

gbk -r GFP_Plasmid_SKO4.

...

^ this is the first time we have used the --threads command while making our index, this is just giving the index command access to all the threads/processors available. While increasing the amount of text printed to our terminal, it also speeds up the process.

^ additionally, you may notice that it uses a much shorter name of the index inside the bowtie2 directory. Recall advice about naming and the risk of making multiple 'my-bowtie2-reference' files and being able to easily know what they arefor.

The following command will take ~13 seconds to complete. Before you run the command execute 'bowtie2 -h' so while the command is running you can try to figure out what the different options are doing that we did not include in our first tutorial.

...

languagebash
titleMap reads

...

gbk SRR4341249_1.fastq 

...

SRR4341249_2.fastq

...


Evaluating the results

Recall the output directory must be transferred back to your local computer with scp to view the html output.

Here we map in with the same conditions as was used in the novel DNA identification tutorial. A description of the options which may be new to you is as follows:

...

titleClick here to explain the new options.

...

--very-sensitive-local

...

-t

...

-p 48

...

-x bowtie2/BW25113_pSKO4

...

-1 SRR4341249_1.fastq 
-2 SRR4341249_2.fastq

...

-S bowtie2/SRR4341249-vsl.sam

...

Expand
titleWhy did this mapping take ~13 seconds despite having 2 references to map against while in the novel DNA identification tutorial it took ~5 minutes?

In the novel DNA identification tutorial, we additionally requested unmapped reads to be printed directly to a fastq file for downstream use.

Evaluating the results

...

Expand
titleWhat percent of reads mapped?

The stdoutput of the program listed:

99.31% overall alignment rate

For reference incase you did not do the novel DNA tutorial, only 65.74% of reads mapped to the chromosome alone. This suggests that more than 1/3 of all the reads came from the plasmid.

...

Code Block
languagebash
titleHandy 1 liner to determine frequencies of reads mapping to each reference. Command broken down below.
echo -e "File\tReference\tReads"; for sam in *.sam;do name=$(echo $sam | sed 's/.sam//');grep -v "^@" $sam | awk -F '\t' '{print $3}' | sort | uniq -c | awk -v sam="$name" '{sum+=$1}{print sam,$2,$1}END{print sam,"total",sum}' OFS='\t';done

Expected output:

No Format
File	Reference	Reads
SRR4341249-vsl	*	13730
SRR4341249-vsl	CP009273	1556647
SRR4341249-vsl	GFP_Plasmid_Sequ	882107
SRR4341249-vsl	total	2452484

Note that the alignment of columns is not awesome but can easily be copied to excel for better alignment.

Additionally note that this would work for any number of sam files (aka samples). 

While the data presented above may be referred to as 'tidy format' where each line contains 1 piece of information, making it easy for computers to parse, personally, my human eyes prefer tables where rows are related to samples (in this case sam files), and columns are related to values (in this case chromosomes mapped to). I'm sure that someone better versed in bash/awk/etc could put together a script to do so directly from the raw file. As I'm better versed in python, instead I use the for loop from the above command, and pipe the tidy results to a .tsv file, then use a python script to convert it to a table. How can you get a hold of this awesome tool you ask? BioITeam of course. 

Code Block
languagebash
titleAlternative command to create a mapping frequency table.
for sam in *.sam;do name=$(echo $sam | sed 's/.sam//');grep -v "^@" $sam | awk -F '\t' '{print $3}' | sort | uniq -c | awk -v sam="$name" '{sum+=$1}{print sam,$2,$1}END{print sam,"total",sum}' OFS='\t';done > ref_counts.tsv;bt2-mapping-ref-count-table.py -i ref_counts.tsv -o ref_counts_table.tsv

While the above doesn't print anything, 'cat ref_counts_table.tsv' prints the following:

No Format
Sample	CP009273	GFP_Plasmid_Sequ	Unmapped	Total
SRR4341249-vsl	1556647	882107	13730	2452484

Again the format is not great, but can be copy pasted to Excel or similar for column alignment. and in either case this will produce a single row for each sample and single column for each reference, while at the same time, changing the * category (which from reading info about samtools you would see is a read that does not map to any reference) to a more human understood 'unmapped' classification.

You were promised a description of the complicated for loop above. 

No Format
example bash command to generate input file
for sam in *.sam;do name=$(echo $sam | sed 's/.sam//');grep -v "^@" $sam | awk -F '\t' '{print $3}' | sort | uniq -c | awk -v sam="$name" '{sum+=$1}{print sam,$2,$1}END{print sam,"total",sum}' OFS='\t';done > ref_counts.tsv


for sam in *.sam;                           # for loop looking at sam files
    do                                      # standard bash for loop formating
    name=$(echo $sam | sed 's/.sam//');     # strip .sam ending from files
    grep -v "^@" $sam |                     # remove header lines from consideration
    awk -F '\t' '{print $3}' |              # Pull the 3rd column from the sam file. 3rd column is name of chromosome read mapped to
    sort |                                  # Sort chromosome names (required for uniq to work correctly)
    uniq -c |                               # take unique chromosome names and return the count of each
    awk                                     # Start of long awk command
        -v sam="$name"                      # define awk variable "sam" using shell variable "$name" (see above) for later printing
        '{sum+=$1}                          # define variable 'sum' as running count of how many total reads were evaluated
        {print sam,$2,$1}                   # for each line (chromosome) print the sam file name, chromosome name, and number of reads mapped to that chromosome
        END{print sam,"total",sum}'         # after all lines (chromosomes) dealt with, print sam file name, the word "total", and the value of variable 'sum' (total number of reads)
        OFS='\t';                           # OutputFieldSeparator set to tab so output will be tsv
    done                                    # standard bash for loop end
> ref_counts.tsv                            # redirect output to ref_counts.tsv file

The above box is a copy paste from the header information found in the bt2-mapping-ref-count-table.py file. 

Code Block
languagebash
titleTo quickly view the entire script consider the following command
less `which bt2-mapping-ref-count-table.py`
Tip

Note 1 last trick here where the text between the ` marks (backtick NOT single quote) is evaluated fist allowing you to open the script for viewing in the less viewer. Frustratingly, once you hit the first ` mark, you lose the ability to tab complete often making it easier to type the which command with tab completion, then go back to add the ` marks.

...

Check the summary.html file, ask for help if you don't understand where to look to find the info on that page.

Next steps

Consider running the same breseq command without the 2nd reference. Does it effect the mutations that are called? Does it effect the percent of reads mapped as was observed in the advanced mapping tutorial?

As noted above, the human trios tutorial deals with calling SNVs on multiple samples at once.  Consider how you could combine the two steps

Since we are looking at a bacteria and it's plasmid, you can run this through breseq. Tutorial can be found here.



Return to GVA2020