Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In previous years it has been common to question what the fastest way of getting a large set of samples analyzed is with respect to threads and Nodes and tasks. Here we hav an opportunity to do just that, and have some surprising results. Since we have been working with idev sessions all along we'll start with the following:

run typeprocessors (-w)time
idev-w 6816 minutes
idev-w 414 minutes

This brings us to our first question: how can using fewer processors speed things up. I am not completely sure but I do have some hypotheses, and expect the answer to be somewhere in the middle:

...

Given that threads only effect the speed of a single sample I also attempted to trim all the read files at once by telling the idev node to run them in the background (this was done by adding a single & as the last character on each line of the commands file)

run typeprocessors (-w)time
idev background-w 68NA error after 54 samples finished 
idev background-w 1NA error after 148 samples finished

While both of these errored after ~1/5 and ~1/2 of the total samples making them a non viable single pass method, they did finish in <10 seconds. Using grep we could generate a list of samples which did finish, store those in a file, then use another grep command with the -v and -f options to remove the finished samples from analysis. This is something that can be very helpful if you have a large number of samples and your run timed out as you can focus on just samples that have not yet finished instead of having to rerun all samples, or manually edit the commands file to remove samples which finished. Since this run failed due to lack of memory, not lack of time its less reasonable to try to tackle it in this manner.

That brings us to our last set of runs: those working with the commands as a submitted job.

run typeprocessors (-w), jobs at once (-n)time
sbatch-w 4  -n 172 minutes
sbatch-w 1 -n 682 minutes
sbatch-w 68 -n 117 minutes

Based on what we have already seen, it is probably not surprising that using 68 (really 16) threads and only evaluating 1 sample at a time took approximately the same amount of time as it did when running on an idev node as those conditions are functionally equivalent. Whaat may be surprising is the lack of improvement despite running 4x more samples at the same time. Again hypothesies:

...

Info

The take away here is that "should I use more threads per command, or launch more simultaneous commands" is not a simple thing to answer. Perhaps as you become more familiar with a particular command you can tease these apart, but more of then not you will likely find yourself balancing the 2 where you luanch jobs with 4-16 threads each and as many commands at once as the 68 available threads and accommodate. 

Evaluating the output

Lets look at how the trimming actually went. The first thing you always want to check when working with a lot of commands simultaneously is if they all finished correctly (note above how running all 272 jobs in the background at basically the same time did not finish correctly). Typically this is where the log files we generate with "&>" come in handy. if you look at the tail of any of the 

Code Block
languagebash
titlesubmit the job to run on the que
ls Trim_Reads | wc -l
grep  -c "TrimmomaticPE: Completed successfully" Queue_job.o*
grep -c "100.00%" Queue_job.o*

The above 3 commands are expected to show 1088 and 272 and 0 respectively, if you see other answers it suggests that something went wrong with the trimming command itself. If so remember I'm on zoom if you need help looking at whats going on. The most common error that I expect will be that you ran out of ram by trying to process too many samples at once (such as if you used a -n 68 option in your .slurm file). 

Beyond the job finishing successfully, the best way to evaluate this data would actually be to move back to the multiqc tutorial and repeat the analysis there that was done on the raw files for the trimmed files here.

Personal experience

The for loop above focuses just on generating the trim commands. In my experience that is only half of the job, the other half is capturing the individual outputs so you can evaluate how many reads were trimmed in what way for each sample. Perhaps you will find this command helpful in your own work: 

Code Block
languagebash
titlecommand taken from my history to trim all files
mkdir trimLogs; mkdir Trim_Reads; for r1 in Raw_Reads/*_R1_*.fastq.gz; do r2=$(echo $r1|sed 's/_R1_/_R2_/'); name=$(echo $r1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "trimmomatic PE $r1 $r2 -baseout Trim_Reads/$name.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:4:30:10 MINLEN:30 >& trimLogs/$name.log"; done > trim_commands

1 of the files you are likely to see something like:

No Format
Duplication rate: 6.5207%

Insert size peak (evaluated by paired-end reads): 80

JSON report: Trim_Logs/E2-1_S189_L001.json
HTML report: rim_Logs/E2-1_S189_L001.html

fastp -i Raw_Reads/E2-1_S189_L001_R1_001.fastq.gz -I Raw_Reads/E2-1_S189_L001_R2_001.fastq.gz -o Trim_Reads/E2-1_S189_L001.trim.R1.fastq.gz -O Trim_Reads/E2-1_S189_L001.trim.R2.fastq.gz -w 68 --detect_adapter_for_pe -j Trim_Logs/E2-1_S189_L001.json -h Trim_Logs/E2-1_S189_L001.html 
fastp v0.23.2, time used: 3 seconds

Typically what you should look for is some kind of anchor that you can pass to grep that is as far down the file as possible. Sometimes you will be lucky and the program will actually print something like "successfully complete". In our case the last line looks promising, "fastp v0.23.2, time used:" seems likely to be printed as the last step in the program.After the job completes, the following command is useful for evaluating its success:

Code Block
languagebash
titlecommand taken from my history to trim all filessubmit the job to run on the que
ls Trim_Logs/*.log.txt|wc -l
echo -e "\nTotal read pairs:";wc -l trim_commands;echo "Successful trimmings:"; fastp.commands
tail -n 1 trimLogsTrim_Logs/*.log.txt|grep -c "TrimmomaticPE: Completed successfully";echo "Potential trimming errors:"; cat trimLogs/*|grep -c "100.00%"

This gives me a quick readout of if all of the individual commands finished correctly.

^fastp v0.23.2, time used:" 

The above 3 commands are all expected to return 272. If so remember I'm on zoom if you need help looking at whats going on. 

Beyond the job finishing successfully, the best way to evaluate this data would actually be to actually use it. That could me running it through multiqc to see how the trimming has effected overall quality, or assembling the reads, or mapping them to a reference (not applicable in this case since they are from a variety of sources and you have not been provided reference filesFurther, if you were to use this for loop rather than the one listed in the tutorial above, you would see each sample generates its own log file in the the trimLogs directory that when investigated with cat/more/less/tail/nano is identical to the output you saw when you trimmed a single sentence allowing you to figure out how different samples are being processed.

Optional next steps:

  1. Consider moving back over to the multiqc tutorial use multiqc to determine well the trimming worked. 
  2. The reads could then further be assembled in the Spades tutorial as they are all plasmid samples.

...