Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • The -P option tells grep to use Perl-style regular expression patterns.
    • This makes including special characters like Tab ( \t ), Carriage Return ( \r ) or Linefeed ( \n ) much easier that the default Posix paterns.
    • While it is not really required here, it generally never hurts doesn't hurt to include this option.
  • '^>' is the regular expression describing the pattern we're looking for (described more below)

  • sacCer3.fa is the file to search. Lines with text that match our pattern will be written to standard output; non matching lines will be omitted.
  • We pipe to more just in case there are a lot of contig names.

...

First, the single quotes around the pattern – they are only a signal for the bash shell. As part of its friendly command line parsing and evaluation, the shell will often look for special characters on the command line that mean something to it (for example, the $ in front of an environment variable name, like in $SCRATCH). Well, regular expressions treat the $ specially too – but in a completely different way! Those single quotes tell the shell "don't look inside here for special characters – treat this as a literal string and pass it to the program". So the The shell will obey, will strip the single quotes off the string, and will pass the actual pattern, ^>, to the grep program. (Aside: We've see that the shell does look inside double quotes ( " ) for certain special signals, such as looking for environment variable names to evaluate.)

...

We might be able to get away with just using this literal alone as our regex, specifying '>' as the command line argument. But for grep, the more specific the pattern, the better. So we constrain where the > can appear on the line. The special carat ( ^ ) character represents "beginning of line". So ^> means "beginning of a line followed by a > character (, followed by anything). (Aside: the dollar sign ( $ ) character represents "end of line" in a regex. There are many other special characters, including period ( . ), question mark ( ? ), pipe ( | ), parentheses ( ( ) ), and brackets ( [ ] ), to name the most common.)

...

Now, we're ready to execute the actual alignment, with the goal of initially producing a SAM /BAM file from the input FASTQ files and reference.  We will first generate SAI files from each of the FASTQ files with the reference individually using the 'aln' command, then combine them (with the reference) into one SAM/BAM output file using the 'sampe' command.   We need a directory to put the alignments when they are finished, as well as any intermediate files, so create a directory called 'alignments'.  The command flow, all together, is as follows. Notice how each file is in its proper directory, which requires us to specify the whole file path in the alignment commands.First go to the align directory, and link to the sacCer3 reference directory (this will make our commands more readable).

Code Block
languagebash
titlePrepare to align yeast data
cd $SCRATCH/core_ngs/align
ln -s $WORK/archive/references/bwa/sacCer3
ls sacCer3

As our workflow indicated, we first use bwa aln on the R1 and R2 FASTQs, producing a BWA-specific .sai intermediate binary files. Since these alignments are completely independent, we can execute them in parallel in a batch job.

What does bwa aln needs in the way of arguments?

Expand
titleHint
bwa aln

There are lots of options, but here is a summary of the most important ones. BWA, is a lot more complex than the options let on. If you look at the BWA manual on the web for the aln sub-command, you'll see numerous options that can increase the alignment rate (as well as decrease it), and all sorts of other things. 

OptionEffect
-kControls the number of mismatches allowable in the seed of each alignment (default = 2)
-lControls the length of the seed (default = 32)
-nControls the number of mismatches (or fraction of bases in a given alignment that can be mismatches) in the entire alignment (including the seed) (default = 0.04)
-tControls the number of threads

The rest of the options control the details of how much a mismatch or gap is penalized, limits on the number of acceptable hits per read, and so on.  Much more information can be accessed at the BWA manual page.

For a simple alignment like this, we can just go with the default alignment parameters, with one exception. At TACC, we want to optimize our alignment speed by allocating more than one thread (-t) to the alignment. We want to run 2 tasks, and will use a minimum of one 16-core node. So we can assign 8 cores to each alignment by specifying -t 8.

Also note that bwa writes its (binary) output to standard output by default, so we need to redirect that to a .sai file. And of course we need to redirect standard error to a log file, one per file.

Create an aln.cmds file (using nano) with the following lines:

Code Block
languagebash
titlebwa aln commands for yeast R1 and R2
bwa aln -t 8 sacCer3/sacCer3.fa fq/Sample_Yeast_L005_R1.cat.fastq.gz > yeast_R1.sai 2> aln.yeast_R1.log
bwa aln -t 8 sacCer3/sacCer3.fa fq/Sample_Yeast_L005_R2.cat.fastq.gz > yeast_R2.sai 2> aln.yeast_R2.log

Create the batch submission script specifying a wayness of 8 (8 tasks per node) on the normal queue and a time of 1 hour, then submit the job and monitor the queue:

Code Block
languagebash
titleSubmit BWA alignment job
launcher_creator.py -n aln -j aln.cmds -t 01:00:00 -q normal -w 8
sbatch aln.slurm
showq -u

Since you have directed standard error to log files, you can use a neat trick to monitor the progress of the alignment: tail -f. The -f means "follow" the tail, so new lines at the end of the file are displayed as they are added to the file.

Code Block
languagebash
titleMonitor alignment progress with tail -f
# Use Ctrl-c to stop the output any time
tail -f aln.yeast_R1.log

When it's done you should see two .sai files. Next we use the bwa sampe command to pair the reads and output SAM format data. For this command you provide the same reference prefix as for bwa aln, along with the two .sai files and the two original FASTQ files.

Again bwa writes its output to standard output, so redirect that to a .sam file. (Note that bwa sampe is "single threaded" – it does not have an option to use more than one processor for its work.) We'll just execute this at the command line – not in a batch job.

Code Block
languagebash
titleBWA global alignment of R1 reads
bwa sampe sacCer3/sacCer3.fa yeast_R1.sai yeast_R2.sai fq
Code Block
cds
mkdir alignments
bwa aln references/yeast/sacCer3.fa fastq_align/Sample_Yeast_L005_R1.cat.fastq.gz > alignments/yeast_R1.sai
bwa aln references/yeast/sacCer3.fa fastq_align/Sample_Yeast_L005_R2.cat.fastq.gz > alignments/yeast_R2.sai
bwa sampe references/yeast/sacCer3.fa alignments/yeast_R1.sai alignments/yeast_R2.sai fastq_align/Sample_Yeast_L005_R1.cat.fastq.gz fastq_alignfq/Sample_Yeast_L005_R2.cat.fastq.gz > alignments/yeast_pairedend.sam

You did it!  In the alignments directory, there should exist the two intermediate files (the SAI files), along with the   You should now have a SAM file that contains the alignments.  It's just a text file, so take a look with head, more, less, tail, or whatever you feel like.  In the next section, with samtools, you'll learn some additional ways to analyze the data. BWA, however, is a lot more complex than the above commands let on.  If you look at the help pages for 'bwa aln' in particluar, there are numerous options that can increase the alignment rate (as well as decrease it), and all sorts of other things.  There are lots of options, but here is a summary of the most important ones.

OptionEffect
-kControls the number of mismatches allowable in the seed of each alignment (default = 2)
-nControls the number of mismatches (or fraction of bases in a given alignment that can be mismatches) in the entire alignment (including the seed) (default = 0.04)
-lControls the length of the seed (default = 32)
-tControls the number of threads

you feel like.  In the next section, with samtools, you'll learn some additional ways to analyze the data once you create a BAM file.

Exercise: What kind of information is in the first lines of the SAM file?

Expand
titleAnswer
The SAM or BAM has a number of header lines, which all start with an at sign ( @ ). The @SQ lines describe each contig and its length. There is also a @PG  line that describes the way the bwa sampe was performed.

Exercise: How many alignment records (not header records) are in the SAM file?

Expand
titleHint

This looks for the pattern  '^HWI' which is the start of every read name (which starts every alignment record).
Remember -c says just count the records, don't display them.

grep -P -c '^@' yeast_pairedend.sam

Or use the -v (invert) option to tell grep to print all lines that don't match a particular pattern, here the header lines starting with @.

grep -P -v -c '^HWI' yeast_pairedend.sam


Expand
titleAnswer
There are 1184360 alignment records.


 

 The rest of the options control the details of how much a mismatch or gap is penalized, limits on the number of acceptable hits per read, and so on.  Much more information can be accessed at the BWA manual page.

Exercise #2: Bowtie2 and Local Alignment - Human microRNA-seq

...