Before executing the script you need to have one environment variable set. We'll do it at the command line here, but you could put it in your .bashrc file.
cd $SCRATCH/core_ngs/align2 $path_code/script/align/align_bwa_illumina.sh
Run the pipeline. By piping the output to tee <filename> we can see the script's progress at the terminal, and it also is written to <filename>.
There are lots of bells and whistles in the arguments, but the most important are the first few:
- FASTQ file – full or relative path to the FASTQ file (just the R1 fastq if paired end). Can be compressed (.gz)
- output prefix – prefix for all the output files produced by the script. Should relate back to what the data is.
- assembly – genome assembly to use.
- there are pre-built indexes for some common eukaryotes (hg19, hg18, mm10, mm9, danRer7, sacCer3) that you can use
- or provide a full path for a bwa reference index you have built somewhere
- paired flag – 0 means single end (the default); 1 means paired end
- hard trim length – if you want the FASTQ hard trimmed down to a specific length, supply that number here
Now run the pipeline. By piping the output to tee <filename> we can see the script's progress at the terminal, and it also is written to <filename>.
$path_code/script/align/align_bwa_illumina.sh ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz yeast_chip sacCer3 1 2>&1 | tee aln.yeast_chip.log
TACC batch system considerations
The great thing about pipeline scripts like this is that you can perform alignments on many datasets in parallel at TACC.
Anna's alignment pipeline scripts are written to take advantage of having multiple cores on TACC nodes, and are thus designed to run with at most two pipeline commands per TACC node.
Assuming you have your alignment commands in a file called aln.cmds, here's how to create and submit a batch job for the commands.
launcher_creator.py -n aln -j aln.cmds -t 12:00:00 -q normal -w 2 sbatch aln.slurm showq -u
Note the maximum run time specified here is 12 hours (-t 12:00:00). This is a reasonable value for a higher eukaryote with 20-40 M reads, and is way more than a yeast alignment would need (~ 4 hours). For very deeply sequenced eukaryotes (e.g. human genome re-sequencing with hundresd of millions of reads), you may want to specify the maximum job time of 48 hours.
Exercise: What would alignment commands look like if you were putting it in a batch system .cmds file?
Assuming you have $path_code set properly before submitting the job, the batch command would look like the command above, but you don't need the tee pipe. Instead, just redirect all output to a file. The example below shows how you would run alignments on two yeast samples in a batch file, adjusting the output prefix (yeast1, yeast2) and log file (aln.yeast1.log, aln.yeast2.log) accordingly.