Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Before executing the script you need to have one environment variable set. We'll do it at the command line here, but you could put it in your .bashrc file.

 

Code Block
languagebash
export path_code=/work/01063/abattenh/seq/code

...

Code Block
languagebash
cd $SCRATCH/core_ngs/align2
$path_code/script/align/align_bwa_illumina.sh

Run the pipeline. By piping the output to tee <filename> we can see the script's progress at the terminal, and it also is written to <filename>.

There are lots of bells and whistles in the arguments, but the most important are the first few:

  1. FASTQ file – full or relative path to the FASTQ file (just the R1 fastq if paired end). Can be compressed (.gz)
  2. output prefix – prefix for all the output files produced by the script. Should relate back to what the data is.
  3. assembly – genome assembly to use.
    • there are pre-built indexes for some common eukaryotes (hg19, hg18, mm10, mm9, danRer7, sacCer3) that you can use
    • or provide a full path for a bwa reference index you have built somewhere
  4. paired flag – 0 means single end (the default); 1 means paired end
  5. hard trim length – if you want the FASTQ hard trimmed down to a specific length, supply that number here

Now run the pipeline. By piping the output to tee <filename> we can see the script's progress at the terminal, and it also is written to <filename>.

Code Block
language
Code Block
languagebash
$path_code/script/align/align_bwa_illumina.sh ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz yeast_chip sacCer3 1 2>&1 | tee aln.yeast_chip.log

TACC batch system considerations

The great thing about pipeline scripts like this is that you can perform alignments on many datasets in parallel at TACC.

Anna's alignment pipeline scripts are written to take advantage of having multiple cores on TACC nodes, and are thus designed to run with at most two pipeline commands per TACC node.

Tip
titleAlways specify wayness 2 for these pipeline scripts

 

Assuming you have your alignment commands in a file called aln.cmds, here's how to create and submit a batch job for the commands.

 

Code Block
languagebash
titleSubmit BWA alignment pipeline job
launcher_creator.py -n aln -j aln.cmds -t 12:00:00 -q normal -w 2
sbatch aln.slurm
showq -u

Note the maximum run time specified here is 12 hours (-t 12:00:00). This is a reasonable value for a higher eukaryote with 20-40 M reads, and is way more than a yeast alignment would need (~ 4 hours). For very deeply sequenced eukaryotes (e.g. human genome re-sequencing with hundresd of millions of reads), you may want to specify the maximum job time of 48 hours.

Exercise: What would alignment commands look like if you were putting it in a batch system .cmds file?

 

Expand
titleAnswer

Assuming you have $path_code set properly before submitting the job, the batch command would look like the command above, but you don't need the tee pipe. Instead, just redirect all output to a file. The example below shows how you would run alignments on two yeast samples in a batch file, adjusting the output prefix (yeast1, yeast2) and log file (aln.yeast1.log, aln.yeast2.log) accordingly.

Code Block
languagebash
$path_code/script/align/align_bwa_illumina.sh ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz yeast1 sacCer3 1 2>&1 > aln.yeast1.log
$path_code/script/align/align_bwa_illumina.sh ./fastq/Sample_ABCDE_L005_R1.cat.fastq.gz yeast2 sacCer3 1 2>&1 > aln.yeast2.log