Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Based on the usage description, we only need to specify two things:

  • the The name of the FASTA file  
  • whether Whether to use the bwtsw or is algorithm for indexing

...

Expand
titleSetup (if needed)
Code Block
languagebash
titleGet the alignment exercises files
mkdir -p $SCRATCH/core_ngs/alignment/fastq
mkdir -p $SCRATCH/core_ngs/references/fasta
cp $CORENGS/alignment/*fastq.gz $SCRATCH/core_ngs/alignment/fastq/
cp $CORENGS/references/*.*fa $SCRATCH/core_ngs/references/fasta/

...

First, the single quotes around the pattern – this tells the bash shell to pass the exact string contents to grep.

As part of its friendly command line parsing and evaluation, the shell will often look for special characters on the command line that mean something to it (for example, the $ in front of an environment variable name, like in $SCRATCH). Well, regular expressions treat the $ specially too – but in a completely different way! Those single quotes tell the shell "don't look inside here for special characters – treat this as a literal string and pass it to the program". The shell will obey, will strip the single quotes off the string, and will pass the actual pattern, ^>, to the grep program. (Note that the shell does look inside double quotes ( " ) for certain special signals, such as looking for environment variable names to evaluate. Read more about Quoting in the shell.)

...

Now, we're ready to execute the actual alignment, with the goal of initially producing a SAM file from the input FASTQ files and reference. First prepare a directory for this exercise and link the sacCer3 reference directories there (this will make our commands more readable).

Expand
titleSetup (if needed)
Code Block
languagebash
# Copy the pre-built references
mkdir -p $SCRATCH/core_ngs/references
rsynccp -avrP $CORENGS/references/*.fa $SCRATCH/core_ngs/references/fasta/

# Get the FASTQ to align
mkdir -p $SCRATCH/core_ngs/alignment/fastq
cp $CORENGS/alignment/*fastq.gz $SCRATCH/core_ngs/alignment/fastq/

...

For a basic alignment like this, we can just go with the default alignment parameters.

Also note Note that bwa writes its (binary) output to standard output by default, so we need to redirect that to a .sai file.

...

Code Block
languagebash
titleStart an idev session
idev -p normal -m 180 -N 1 -n 24 -A UT-2015-05-18 --reservation=CCBBintro_NGS
Tip

You can tell you're in a idev session because the hostname command will return a compute node name (e.g. nid00438) instead of a login node name (e.g. login5).

For simplicity, we will just execute these commands directly, one at a time. Each command should only take few minutes and you will see bwa's progress messages in your terminal.

...

Code Block
languagebash
titlePairing of BWA R1 and R2 aligned reads
bwa sampe sacCer3/sacCer3.fa yeast_R1.sai yeast_R2.sai \
  fastq/Sample_Yeast_L005_R1.cat.fastq.gz \
  fastq/Sample_Yeast_L005_R2.cat.fastq.gz > yeast_pairedend.sam

...

Expand
titleHint

This looks for the pattern  '^HWI' which is the start of every read name (which starts every alignment record).
Remember -c says just count the records, don't display them.

Code Block
languagebash
grep -P -c '^HWI' yeast_pairedend.sam

Or use the -v (invert) option to tell grep to print all lines that don't match a particular pattern; here, all the header lines that , which start with @.

Code Block
languagebash
grep -P -v -c '^@' yeast_pairedend.sam


...

  • Do both R1 and R2 reads have separate alignment records?
  • Does the SAM file contain both mapped and un-mapped reads?
  • What is the order of the alignment records in this SAM file?

Expand
titleAnswers

Both R1 and R2 reads must have separate alignment records, because there were 1,184,360 R1+R2 reads and the same number of alignment records.

The SAM file must contain both mapped and un-mapped reads, because there were 1,184,360 R1+R2 reads and the same number of alignment records.

Alignment records occur in the same read-name order as they did in the FASTQ, except that they come in pairs. The R1 read comes 1st, then the corresponding R2. This is called read name ordering.

...

You may have noticed that some alignment records contain contig names (e.g. chrV) in field 3 while others contain an asterisk ( * ). Usually the The * means the record didn't map. (This isn't always true – later you'll see how to properly distinguish between mapped and unmapped reads using samtools.) We're going to use this heuristic along with cut to see about how many records represent aligned sequences. (Note this is not the strictly correct method of finding unmapped reads because not all unmapped reads have an asterisk in field 3. Later you'll see how to properly distinguish between mapped and unmapped reads using samtools.)

First we need to make sure that we don't look at fields in the SAM header lines. We're going to end up with a series of pipe operations, and the best way to make sure you're on track is to enter them one at a time piping to head:

Code Block
languagebash
titleGrep pattern that doesn't match header
# the ^HWI^@ pattern matches lines starting with HWI@ (the start of all read names in column 1)only header lines), 
# and -v says output lines that don't match
grep -v -P '^HWI^@' yeast_pairedend.sam | head

...

Code Block
languagebash
titleGet contig name info with cut
grep -v -P '^HWI^@' yeast_pairedend.sam | cut -f 3 | head

...

Code Block
languagebash
titleFilter contig name of * (unaligned)
grep -v -P '^HWI^@' yeast_pairedend.sam | cut -f 3 | grep -v '*' | head

...

Code Block
languagebash
titleCount aligned SAM records
grep -v -P '^HWI^@' yeast_pairedend.sam | cut -f 3 | grep -v '*' | wc -l

...