Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Like other tools you've worked with so far, you first need to load bwa. Do that now, and then enter bwa with no arguments to view the top-level help page (many NGS tools will provide some help when called with no arguments). bwa is available as a BioContainers. module.

...

Make sure you're in

...

an idev session

Code Block
languagebash
titleStart an idev session
idev -m 
120
180 -N 1 -A OTH21164 -r
CoreNGSday4
 CoreNGS-Thu
# or
idev -m 90 -N 1 -A OTH21164 -p development


Code Block
languagebash
module load biocontainers  # takes a while
module load bwa
bwa

...

Code Block
languagebash
titlePrepare BWA reference directory for sacCer3
mkdir -p $SCRATCH/core_ngs/references/bwa/sacCer3
cd $SCRATCH/core_ngs/references/bwa/sacCer3
ln -ssf ../../fasta/sacCer3.fa
ls -l

...

Code Block
languagebash
titlePrepare to align yeast data
mkdir -p $SCRATCH/core_ngs/alignment/yeast_bwa
cd $SCRATCH/core_ngs/alignment/yeast_bwa
ln -ssf -f ../fastq
ln -ssf -f ../../references/bwa/sacCer3

...

Expand
titleHint


Code Block
languagebash
bwa aln


Required arguments are a <prefix> of the bwa index files, and the input FASTQ file. There are lots of options, but here is a summary of the most important ones.

...

You should now have a SAM file (yeast_pe.sam) that contains the alignments. It's just a text file, so take a look with head, more, less, tail, or whatever you feel like. Later you'll learn additional ways to analyze the data with samtools once you create a BAM file.

Exercise: How many lines does the SAM file have? How does this compare to the number of input sequences (R1+R2)?

Expand
titleAnswer

wc -l yeast_pe.sam  reports 1,184,378 lines

The alignment SAM file will contain records for both R1 and R2 reads, so we need to count sequences in both files.

zcat ./fastq/Sample_Yeast_L005_R[12]*gz | wc -l | awk '{print $1/4}' reports 1,184,360 reads that were aligned

So the SAM file has 18 more lines than the R1+R2 total. These are the header records that appear before any alignment records.

It's just a text file, so take a look with head, more, less, tail, or whatever you feel like. Later you'll learn additional ways to analyze the data with samtools once you create a BAM file.

Exercise: Exercise: What kind of information is in the first lines of the SAM file?

...

Expand
titleHint

This looks for the pattern  '^HWI' which is the start of every read name (which starts every alignment record).
Remember -c says just count the records, don't display them.

Code Block
languagebash
grep -P -c '^HWI' yeast_pe.sam

Or use the -v (invert) option to tell grep to print all lines that don't match a particular pattern; here, all header lines, which start with @.

Code Block
languagebash
grep -P -v -c '^@' yeast_pe.sam
Expand
titleAnswer
There are 1,184,360 alignment records.

Exercise: How many sequences were in the R1 and R2 FASTQ files combined?

.

Code Block
languagebash
grep -P -v -c '^@' yeast_pe.sam
Expand
titleHint
zcat fastq/Sample_Yeast_L005_R[12].cat.fastq.gz | wc -l | awk '{print $1/4}'



Expand
titleAnswer
There were a total of are 1,184,360 original sequences (R1s + R2s)alignment records.

Exercises:

...

  • Does the SAM file contain both mapped and un-mapped reads?
  • What is the order of the alignment records in this SAM file?

Expand
titleAnswers

Both R1 and R2 reads must have separate alignment records, because there were 1,184,360 R1+R2 reads and the same number of alignment records.

The SAM file must contain both mapped and unmapped un-mapped reads, because there were 1,184,360 R1+R2 reads and the same number of alignment records.

Alignment records occur in the same read-name order as they did in the FASTQ, except that they come in pairs. The R1 read comes 1st, then the corresponding R2. This is called read name ordering.

...