Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
align_bwa_illumina.sh 20172020_0906_0714
Align Illumina SE or PE data with bwa. Produces a sorted, indexed,
duplicate-marked BAM file and various statistics files. Usage:

align_bwa_illumina.sh <aln_mode> <in_file> <out_pfx> <assembly> [ paired trim_sz trim_sz2 seq_fmt qual_fmt ]

Required arguments:
  aln_mode  Alignment mode, either global (bwa aln) or local (bwa mem).
  in_file   For single-end alignments, path to input sequence file.
            For paired-end alignments using fastq, path to the the R1
            fastq file which must contain the string 'R1' in its name.
            The corresponding 'R2' must have the same path except for 'R1'.
  out_pfx   Desired prefix of output files in the current directory.
  assembly  One of hg38, hg19, hg38, mm10, mm9, sacCer3, sacCer1, ce11, ce10,
            danRer7, hs_mirbase, mm_mirbase, or reference index prefix.
Optional arguments:
  paired    0 = single end alignment (default); 1 = paired end.
  trim_sz   Size to trim reads to. Default 0 (no trimming)
  trim_sz2  Size to trim R2 reads to for paired end alignments.
            Defaults to trim_sz
  seq_fmt   Format of sequence file (fastq, bam or scarf). Default is
            fastq if the input file has a '.fastq' extension; scarf
            if it has a '.sequence.txt' extension.
  qual_type Type of read quality scores (sanger, illumina or solexa).
            Default is sanger for fastq, illumina for scarf.
Environment variables:
  show_only  1 = only show what would be done (default not set)
  aln_args   other bowtie2 options (e.g. '-T 20' for mem, '-l 20' for aln)
  no_markdup 1 = don't mark duplicates (default 0, mark duplicates)
  run_fastqc 1 = run fastqc (default 0, don't run). Note that output
             will be in the directory containing the fastq files.
  keep       1 = keep unsorted BAM (default 0, don't keep)
  bwa_bin    BWA binary to use. Default bwa 0.7.x. Note that bwa 0.6.2
             or earlier should be used for scarf and other short reads.
  also: NUM_THREADS, BAM_SORT_MEM, SORT_THREADS, JAVA_MEM_ARG

Examples:
  align_bwa_illumina.sh local  ABC_L001_R1.fastq.gz my_abc hg38 1
  align_bwa_illumina.sh global ABC_L001_R1.fastq.gz my_abc hg38 1 50
  align_bwa_illumina.sh global sequence.txt old sacCer3 0 '' '' scarf solexa

...

  1. aln_mode – whether to perform a global or local alignment (the 1st argument must be one of those words)
    • global mode uses the bwa aln workflow as we did above
    • local mode uses the bwa mem command
  2. FASTQ in_file – full or relative path to the FASTQ file (just the R1 fastq if paired end). Can be compressed (.gz)
  3. output prefix out_pfxprefix for all the output files produced by the script. Should relate back to what the data is.
  4. assembly – genome assembly to use.
    • there are pre-built indexes for some common eukaryotes (hg38, hg19, hg18, mm10, mm9, danRer7, sacCer3) that you can use
    • or provide a full path for a bwa reference index you have built somewhere
  5. paired flag – flag 0 means single end (the default); 1 means paired end
  6. hard trim length_sz – if you want the FASTQ hard trimmed down to a specific length before alignment, supply that number here

We're going to run this script and a similar Bowtie2 alignment script, on the yeast data using the TACC batch system. In a new directory, copy over the commands and submit the batch job. We ask for 2 hours 1 hour (-t 201:00:00) with 4 tasks/node (-w 4); since we have 4 commands, this will run on 1 compute node.

Expand
titleCatch up (if needed)
Code Block
languagebash
# Copy over the Yeast data if needed
mkdir -p $SCRATCH/core_ngs/alignment/fastq
cp $CORENGS/alignment/Sample_Yeast*.gz $SCRATCH/core_ngs/alignment/fastq/

# Copy a pre-built sacCer3 reference if you didn't build one already
mkdir -p $SCRATCH/core_ngs/references
rsync -avrP $CORENGS/references/ $SCRATCH/core_ngs/references/

...

Code Block
languagebash
titleRun multiple alignments using the TACC batch system
# Make sure you're not in an idev session by looking at the hostname
hostname
# If the hostname starts with "nid", exit the idev session

# Make a new alignment directory for running these scripts
mkdir -p $SCRATCH/core_ngs/alignment/bwa_script
cd $SCRATCH/core_ngs/alignment/bwa_script
ln -s -f ../fastq

# Copy the alignment commands file and submit the batch job
cp $CORENGS/tacc/aln_script.cmds .
launcher_makercreator.py -nj aln_script.cmds -n aln_script -t 201:00:00 -w 4 -a UT-2015-05-18 -vq normal
sbatch --reservation=intro_NGS aln_script.slurm --reservation=CCBB
showq -u

While we're waiting for the job to complete, lets look at the aln_script.cmds file.

Code Block
languagebash
titleCommands to run multiple alignment scripts
/work/projects/BioITeam/common/script/align_bwa_illumina.sh     global ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz bwa_global sacCer3 1 50
/work/projects/BioITeam/common/script/align_bwa_illumina.sh     local  ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz bwa_local  sacCer3 1
/work/projects/BioITeam/common/script/align_bowtie2_illumina.sh global ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz bt2_global sacCer3 1 50
/work/projects/BioITeam/common/script/align_bowtie2_illumina.sh local  ./fastq/Sample_Yeast_L005_R1.cat.fastq.gz bt2_local  sacCer3 1

Notes:

  • The 1st command performs a paired-end BWA global alignment (similar to what we did beforeabove), but asks that the 100 bp reads be trimmed to 50 first.
    • we refer to the pre-built index for yeast by name: sacCer3
      • this index is located in the /work/projects/BioITeam/ref_genome/bwa/bwtsw/sacCer3/ directory
    • we provide the name of the R1 FASTQ file
      • because we request a PE alignment (the 1 argument) the script will look for a similarly-named R2 file.
    • all output files associated with this command will be named with the prefix bwa_global.
  • The 2nd command performs a paired-end BWA local alignment.
    • all output files associated with this command will be named with the prefix bwa_local.
    • no trimming is requested because the local alignment should ignore 5' and 3' bases that don't match the reference genome
  • The 3rd command performs a paired-end Bowtie2 global alignment.
    • the Bowtie2 alignment script has the same first arguments as the BWA alignment script.
    • all output files associated with this command will be named with the prefix bt2_global.
    • again, we specify that reads should first be trimmed to 50 bp.
  • The 4th command performs a paired-end Bowtie2 local alignment.
    • all output files associated with this command will be named with the prefix bt2_local.
    • again, no trimming is requested for the local alignment.

...

Code Block
..Done alignmentUtils.pl bamstats - 20182020-0506-1914 1423:3419:4338
.. samstats file 'bwa_global.samstats.txt' exists 2018-05-19 14:34:43
..samstats file 'bwa_global.samstats.txt' size ok 2018-05-19 14:34:43
and is not empty - 2020-06-14 23:19:38
===============================================================================
## Cleaning up files (keep 0) - 20182020-0506-1914 1423:3419:43
38
===============================================================================
ckRes 0 cleanup
===============================================================================
## All bwa alignment tasks completed successfully! - 20182020-0506-1914 1423:3419:4338
===============================================================================

Notice that success message:  "All bwa alignment tasks completed successfully!". It should only appear once in any successful alignment log.

...