Page History

...

Like other tools you've worked with so far, you first need to load bwa. Do that now, and then enter bwa with no arguments to view the top-level help page (many NGS tools will provide some help when called with no arguments). bwa is available as a BioContainers. module.

...

Make sure you're in

...

an idev session

Code Block

language	bash
title	Start an idev session

idev -m

120

180 -N 1 -A OTH21164 -r

CoreNGSday4

 CoreNGS-Thu
# or
idev -m 90 -N 1 -A OTH21164 -p development

Code Block

language	bash

module load biocontainers  # takes a while
module load bwa
bwa

...

Code Block

language	bash
title	Prepare BWA reference directory for sacCer3

mkdir -p $SCRATCH/core_ngs/references/bwa/sacCer3
cd $SCRATCH/core_ngs/references/bwa/sacCer3
ln -ssf ../../fasta/sacCer3.fa
ls -l

...

Code Block

language	bash
title	Prepare to align yeast data

mkdir -p $SCRATCH/core_ngs/alignment/yeast_bwa
cd $SCRATCH/core_ngs/alignment/yeast_bwa
ln -ssf -f ../fastq
ln -ssf -f ../../references/bwa/sacCer3

...

Expand

title	Hint

Code Block

language	bash

bwa aln

Required arguments are a <prefix> of the bwa index files, and the input FASTQ file. There are lots of options, but here is a summary of the most important ones.

...

You should now have a SAM file (yeast_pe.sam) that contains the alignments. It's just a text file, so take a look with head, more, less, tail, or whatever you feel like. Later you'll learn additional ways to analyze the data with samtools once you create a BAM file.

Exercise: How many lines does the SAM file have? How does this compare to the number of input sequences (R1+R2)?

Expand

title	Answer

wc -l yeast_pe.sam reports 1,184,378 lines

The alignment SAM file will contain records for both R1 and R2 reads, so we need to count sequences in both files.

zcat ./fastq/Sample_Yeast_L005_R[12]*gz | wc -l | awk '{print $1/4}' reports 1,184,360 reads that were aligned

So the SAM file has 18 more lines than the R1+R2 total. These are the header records that appear before any alignment records.

It's just a text file, so take a look with head, more, less, tail, or whatever you feel like. Later you'll learn additional ways to analyze the data with samtools once you create a BAM file.

Exercise: Exercise: What kind of information is in the first lines of the SAM file?

...

Expand

title	Hint

This looks for the pattern '^HWI' which is the start of every read name (which starts every alignment record).
Remember -c says just count the records, don't display them.

Code Block

language	bash

grep -P -c '^HWI' yeast_pe.sam

Or use the -v (invert) option to tell grep to print all lines that don't match a particular pattern; here, all header lines, which start with @.

Code Block

language	bash

grep -P -v -c '^@' yeast_pe.sam

Expand

title	Answer

There are 1,184,360 alignment records.

Exercise: How many sequences were in the R1 and R2 FASTQ files combined?

.

Code Block

language	bash

grep -P -v -c '^@' yeast_pe.sam

Expand

title	Hint

zcat fastq/Sample_Yeast_L005_R[12].cat.fastq.gz | wc -l | awk '{print $1/4}'

Expand

title	Answer

There were a total of are 1,184,360 original sequences (R1s + R2s)alignment records.

Exercises:

...

Does the SAM file contain both mapped and un-mapped reads?
What is the order of the alignment records in this SAM file?

Expand

title	Answers

Both R1 and R2 reads must have separate alignment records, because there were 1,184,360 R1+R2 reads and the same number of alignment records.

The SAM file must contain both mapped and unmapped un-mapped reads, because there were 1,184,360 R1+R2 reads and the same number of alignment records.

Alignment records occur in the same read-name order as they did in the FASTQ, except that they come in pairs. The R1 read comes 1st, then the corresponding R2. This is called read name ordering.

...

Page tree

Versions Compared

Old Version 194

New Version 195

Key