Page History

...

Before executing the script you need to have one environment variable set. We'll do it at the command line here, but you could put it in your .bashrc file.

Code Block

language	bash

export path_code=/work/01063/abattenh/seq/code

...

Code Block

language	bash

cd $SCRATCH/core_ngs/align2
$path_code/script/align/align_bwa_illumina.sh

Run the pipeline. By piping the output to tee <filename> we can see the script's progress at the terminal, and it also is written to <filename>.

There are lots of bells and whistles in the arguments, but the most important are the first few:

FASTQ file – full or relative path to the FASTQ file (just the R1 fastq if paired end). Can be compressed (.gz)
output prefix – prefix for all the output files produced by the script. Should relate back to what the data is.
assembly – genome assembly to use.
- there are pre-built indexes for some common eukaryotes (hg19, hg18, mm10, mm9, danRer7, sacCer3) that you can use
- or provide a full path for a bwa reference index you have built somewhere
paired flag – 0 means single end (the default); 1 means paired end
hard trim length – if you want the FASTQ hard trimmed down to a specific length, supply that number here

Now run the pipeline. By piping the output to tee <filename> we can see the script's progress at the terminal, and it also is written to <filename>.

Code Block

language

Code Block

language	bash

$path_code/script/align/align_bwa_illumina.sh ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz yeast_chip sacCer3 1 2>&1 | tee aln.yeast_chip.log

TACC batch system considerations

The great thing about pipeline scripts like this is that you can perform alignments on many datasets in parallel at TACC.

Anna's alignment pipeline scripts are written to take advantage of having multiple cores on TACC nodes, and are thus designed to run with at most two pipeline commands per TACC node.

Tip

title	Always specify wayness 2 for these pipeline scripts

Assuming you have your alignment commands in a file called aln.cmds, here's how to create and submit a batch job for the commands.

Code Block

language	bash
title	Submit BWA alignment pipeline job

launcher_creator.py -n aln -j aln.cmds -t 12:00:00 -q normal -w 2
sbatch aln.slurm
showq -u

Note the maximum run time specified here is 12 hours (-t 12:00:00). This is a reasonable value for a higher eukaryote with 20-40 M reads, and is way more than a yeast alignment would need (~ 4 hours). For very deeply sequenced eukaryotes (e.g. human genome re-sequencing with hundresd of millions of reads), you may want to specify the maximum job time of 48 hours.

Exercise: What would alignment commands look like if you were putting it in a batch system .cmds file?

Expand

title	Answer

Assuming you have $path_code set properly before submitting the job, the batch command would look like the command above, but you don't need the tee pipe. Instead, just redirect all output to a file. The example below shows how you would run alignments on two yeast samples in a batch file, adjusting the output prefix (yeast1, yeast2) and log file (aln.yeast1.log, aln.yeast2.log) accordingly.

Code Block

language	bash

$path_code/script/align/align_bwa_illumina.sh ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz yeast1 sacCer3 1 2>&1 > aln.yeast1.log
$path_code/script/align/align_bwa_illumina.sh ./fastq/Sample_ABCDE_L005_R1.cat.fastq.gz yeast2 sacCer3 1 2>&1 > aln.yeast2.log

Page tree

Versions Compared

Old Version 111

New Version 112

Key

TACC batch system considerations