...
- The -P option tells grep to use Perl-style regular expression patterns.
- This makes including special characters like Tab ( \t ), Carriage Return ( \r ) or Linefeed ( \n ) much easier that the default Posix paterns.
- While it is not really required here, it generally never hurts doesn't hurt to include this option.
'^>' is the regular expression describing the pattern we're looking for (described more below)
- sacCer3.fa is the file to search. Lines with text that match our pattern will be written to standard output; non matching lines will be omitted.
- We pipe to more just in case there are a lot of contig names.
...
First, the single quotes around the pattern – they are only a signal for the bash shell. As part of its friendly command line parsing and evaluation, the shell will often look for special characters on the command line that mean something to it (for example, the $ in front of an environment variable name, like in $SCRATCH). Well, regular expressions treat the $ specially too – but in a completely different way! Those single quotes tell the shell "don't look inside here for special characters – treat this as a literal string and pass it to the program". So the The shell will obey, will strip the single quotes off the string, and will pass the actual pattern, ^>, to the grep program. (Aside: We've see that the shell does look inside double quotes ( " ) for certain special signals, such as looking for environment variable names to evaluate.)
...
We might be able to get away with just using this literal alone as our regex, specifying '>' as the command line argument. But for grep, the more specific the pattern, the better. So we constrain where the > can appear on the line. The special carat ( ^ ) character represents "beginning of line". So ^> means "beginning of a line followed by a > character (, followed by anything). (Aside: the dollar sign ( $ ) character represents "end of line" in a regex. There are many other special characters, including period ( . ), question mark ( ? ), pipe ( | ), parentheses ( ( ) ), and brackets ( [ ] ), to name the most common.)
...
Now, we're ready to execute the actual alignment, with the goal of initially producing a SAM /BAM file from the input FASTQ files and reference. We will first generate SAI files from each of the FASTQ files with the reference individually using the 'aln' command, then combine them (with the reference) into one SAM/BAM output file using the 'sampe' command. We need a directory to put the alignments when they are finished, as well as any intermediate files, so create a directory called 'alignments'. The command flow, all together, is as follows. Notice how each file is in its proper directory, which requires us to specify the whole file path in the alignment commands.First go to the align directory, and link to the sacCer3 reference directory (this will make our commands more readable).
Code Block | ||||
---|---|---|---|---|
| ||||
cd $SCRATCH/core_ngs/align
ln -s $WORK/archive/references/bwa/sacCer3
ls sacCer3 |
As our workflow indicated, we first use bwa aln on the R1 and R2 FASTQs, producing a BWA-specific .sai intermediate binary files. Since these alignments are completely independent, we can execute them in parallel in a batch job.
What does bwa aln needs in the way of arguments?
Expand | ||
---|---|---|
| ||
bwa aln |
There are lots of options, but here is a summary of the most important ones. BWA, is a lot more complex than the options let on. If you look at the BWA manual on the web for the aln sub-command, you'll see numerous options that can increase the alignment rate (as well as decrease it), and all sorts of other things.
Option | Effect |
---|---|
-k | Controls the number of mismatches allowable in the seed of each alignment (default = 2) |
-l | Controls the length of the seed (default = 32) |
-n | Controls the number of mismatches (or fraction of bases in a given alignment that can be mismatches) in the entire alignment (including the seed) (default = 0.04) |
-t | Controls the number of threads |
The rest of the options control the details of how much a mismatch or gap is penalized, limits on the number of acceptable hits per read, and so on. Much more information can be accessed at the BWA manual page.
For a simple alignment like this, we can just go with the default alignment parameters, with one exception. At TACC, we want to optimize our alignment speed by allocating more than one thread (-t) to the alignment. We want to run 2 tasks, and will use a minimum of one 16-core node. So we can assign 8 cores to each alignment by specifying -t 8.
Also note that bwa writes its (binary) output to standard output by default, so we need to redirect that to a .sai file. And of course we need to redirect standard error to a log file, one per file.
Create an aln.cmds file (using nano) with the following lines:
Code Block | ||||
---|---|---|---|---|
| ||||
bwa aln -t 8 sacCer3/sacCer3.fa fq/Sample_Yeast_L005_R1.cat.fastq.gz > yeast_R1.sai 2> aln.yeast_R1.log
bwa aln -t 8 sacCer3/sacCer3.fa fq/Sample_Yeast_L005_R2.cat.fastq.gz > yeast_R2.sai 2> aln.yeast_R2.log |
Create the batch submission script specifying a wayness of 8 (8 tasks per node) on the normal queue and a time of 1 hour, then submit the job and monitor the queue:
Code Block | ||||
---|---|---|---|---|
| ||||
launcher_creator.py -n aln -j aln.cmds -t 01:00:00 -q normal -w 8
sbatch aln.slurm
showq -u |
Since you have directed standard error to log files, you can use a neat trick to monitor the progress of the alignment: tail -f. The -f means "follow" the tail, so new lines at the end of the file are displayed as they are added to the file.
Code Block | ||||
---|---|---|---|---|
| ||||
# Use Ctrl-c to stop the output any time
tail -f aln.yeast_R1.log |
When it's done you should see two .sai files. Next we use the bwa sampe command to pair the reads and output SAM format data. For this command you provide the same reference prefix as for bwa aln, along with the two .sai files and the two original FASTQ files.
Again bwa writes its output to standard output, so redirect that to a .sam file. (Note that bwa sampe is "single threaded" – it does not have an option to use more than one processor for its work.) We'll just execute this at the command line – not in a batch job.
Code Block | ||||
---|---|---|---|---|
| ||||
bwa sampe sacCer3/sacCer3.fa yeast_R1.sai yeast_R2.sai fq | ||||
Code Block | ||||
cds mkdir alignments bwa aln references/yeast/sacCer3.fa fastq_align/Sample_Yeast_L005_R1.cat.fastq.gz > alignments/yeast_R1.sai bwa aln references/yeast/sacCer3.fa fastq_align/Sample_Yeast_L005_R2.cat.fastq.gz > alignments/yeast_R2.sai bwa sampe references/yeast/sacCer3.fa alignments/yeast_R1.sai alignments/yeast_R2.sai fastq_align/Sample_Yeast_L005_R1.cat.fastq.gz fastq_alignfq/Sample_Yeast_L005_R2.cat.fastq.gz > alignments/yeast_pairedend.sam |
You did it! In the alignments directory, there should exist the two intermediate files (the SAI files), along with the You should now have a SAM file that contains the alignments. It's just a text file, so take a look with head, more, less, tail, or whatever you feel like. In the next section, with samtools, you'll learn some additional ways to analyze the data. BWA, however, is a lot more complex than the above commands let on. If you look at the help pages for 'bwa aln' in particluar, there are numerous options that can increase the alignment rate (as well as decrease it), and all sorts of other things. There are lots of options, but here is a summary of the most important ones.
Option | Effect |
---|---|
-k | Controls the number of mismatches allowable in the seed of each alignment (default = 2) |
-n | Controls the number of mismatches (or fraction of bases in a given alignment that can be mismatches) in the entire alignment (including the seed) (default = 0.04) |
-l | Controls the length of the seed (default = 32) |
-t | Controls the number of threads |
you feel like. In the next section, with samtools, you'll learn some additional ways to analyze the data once you create a BAM file.
Exercise: What kind of information is in the first lines of the SAM file?
Expand | ||
---|---|---|
| ||
The SAM or BAM has a number of header lines, which all start with an at sign ( @ ). The @SQ lines describe each contig and its length. There is also a @PG line that describes the way the bwa sampe was performed. |
Exercise: How many alignment records (not header records) are in the SAM file?
Expand | ||
---|---|---|
| ||
This looks for the pattern '^HWI' which is the start of every read name (which starts every alignment record).
Or use the -v (invert) option to tell grep to print all lines that don't match a particular pattern, here the header lines starting with @.
|
Expand | ||
---|---|---|
| ||
There are 1184360 alignment records. |
The rest of the options control the details of how much a mismatch or gap is penalized, limits on the number of acceptable hits per read, and so on. Much more information can be accessed at the BWA manual page.
Exercise #2: Bowtie2 and Local Alignment - Human microRNA-seq
...