Page History

...

With that, we're ready to get started on the first exercise.

Exercise

...

#1: BWA – Yeast ChIP-seq

Overview ChIP-seq alignment workflow with BWA

...

Expand

title	Hint

This looks for the pattern '^HWI' which is the start of every read name (which starts every alignment record).
Remember -c says just count the records, don't display them.

Code Block

language	bash

grep -P -c '^@^HWI' yeast_pairedend.sam

Or use the -v (invert) option to tell grep to print all lines that don't match a particular pattern, here the header lines starting with @.

Code Block

language	bash

grep -P -v -c '^HWI^@' yeast_pairedend.sam

Expand

title	Answer

There are 1,184,360 alignment records.

...

Expand

title	Answers

Do both R1 and R2 reads have separate alignment records?
- yes, they must, because there were 1,184,360 R1+R2 reads and an equal number of alignment records
Does the SAM file contain both aligned and un-aligned reads?
- yes, it must, because there were 1,184,360 R1+R2 reads and an equal number of alignment records
What is the order of the alignment records in this SAM file?
- the names occur in the exact same order as they did in the FASTQ, except that they come in pairs
  - the R1 read comes first, then its corresponding R2
- this ordering is called read name ordering

Using cut to isolate fields

Suppose you wanted to look only at field 3 (contig name) values in the SAM file. You can do this with the handy cut command. Below is a simple example where you're asking cut to display the 3rd of the last 10 alignments.

Code Block

language	bash
title	Cut syntax for a single field

tail yeast_pairedend.sam | cut -f 3

By default cut assumes the field delimiter is Tab, which is the delimiter used in the majority of NGS file formats. You can, of course, specify a different delimiter with the -d option.

You can also specify a range of fields, and mix adjacent and non-adjacent fields. This displays fields 2 through 6, field 9, and all fields starting with 12 (SAM tag fields).

Code Block

language	bash
title	Cut syntax for multiple fields

tail yeast_pairedend.sam | cut -f 2-6,9,12-

You may have noticed that some alignment records contain contig names (e.g. chrV) in field 3 while others contain an asterisk ( * ). Usually the * means the record didn't align. (This isn't always true – later you'll see how to properly distinguish between mapped and unmapped reads using samtools.) We're going to use this heuristic along with cut to see about how many records represent aligned sequences.

First we need to make sure that we don't look at fields in the SAM header lines. We're going to end up with a series of pipe operations, and the best way to make sure you're on track is to enter them one at a time piping to head:

Code Block

language	bash
title	Grep pattern that doesn't match header

# the ^HWI pattern matches lines starting with HWI (the start of all read names in column 1)
grep -P '^HWI' yeast_pairedend.sam | head

Ok, it looks like we're seeing only alignment records. Now let's pull out only field 3 using cut:

Code Block

language	bash
title	Get contig name info with cut

grep -P -v '^@' yeast_pairedend.sam | cut -f 3 | head

Cool, we're only seeing the contig name info now. Next we use grep again, piping it our contig info and using the -v (invert) switch to say print lines that don't match the pattern:

Code Block

language	bash
title	Filter contig name of * (unaligned)

grep -P -v '^@' yeast_pairedend.sam | cut -f 5 | grep -v '*' | head

Perfect! We're only seeing real contig names that (usually) represent aligned reads. Let's count them by piping to wc -l (and omitting omit head of course – we want to count everything).

Code Block

language	bash
title	Count unaligned SAM records

grep -P -v '^@' yeast_pairedend.sam | cut -f 5 | grep -v '*' | wc -l

Exercise: About how many records represent aligned sequences? What alignment rate does this represent?

Expand

title	Answer

The expression above returns 612,968. There were 1,184,360 records total, so the percentage is:

Code Block

language	bash
title	Calculate alignment rate

echo $((612968 * 100/ 1184360))

or about 51%. Not great.

Exercise: What might we try in order to improve the alignment rate?

Expand

title	Answer

Recall that these are 100 bp reads and we did not remove adapter contamination. There will be a distribution of fragment sizes – some will be short – and those short fragments may not align without adapter removal (fastx_trimmer or cutadapt).

Exercise #2: Bowtie2 and Local Alignment - Human microRNA-seq

...

Page tree

Versions Compared

Old Version 55

New Version 56

Key

Exercise

#1: BWA – Yeast ChIP-seq

Overview ChIP-seq alignment workflow with BWA

Using cut to isolate fields

Exercise #2: Bowtie2 and Local Alignment - Human microRNA-seq