Page History

...

Code Block

language	bash
title	Get contig name info with cut

grep -P -v'^HWI '^@' yeast_pairedend.sam | cut -f 3 | head

...

Code Block

language	bash
title	Filter contig name of * (unaligned)

grep -P -v '^@^HWI' yeast_pairedend.sam | cut -f 53 | grep -v '*' | head

Perfect! We're only seeing real contig names that (usually) represent aligned reads. Let's count them by piping to wc -l (and omitting omit head of course – we want to count everything).

Code Block

language	bash
title	Count unaligned aligned SAM records

grep -P -v '^@^HWI' yeast_pairedend.sam | cut -f 53 | grep -v '*' | wc -l

Exercise: About how many records represent aligned sequences? What alignment rate does this represent?

...

Expand

title	What's going on?

Parameters are:

--local – local alignment mode
-L 16 – seed length 16
-N 1 – allow 1 mismatch in the seed
-x mb20/hairpin_cDNA_hsa.fa – prefix path of index files
-Ufq/human_mirnaseq.fastq.gz – FASTQ file for single-end (Unpaired) alignment
-Shuman_mirnaseq.sam – tells bowtie2 to report alignments in SAM format to the specified file

...

Create a commands file called bt2.cmds with this task definition then generate and submit a batch job for it (time 1 hour, normal queue).

Expand

title	What's going on?

Use nano to create the bt2.cmds file. Then:

Code Block

language	bash
title	Local bowti2 alignment of miRNA data

launcher_creator.py -n bt2 -j bt2.cmds -t 01:00:00 -q normal
sbatch bt2.slurm; showq -u

...

Such is the nature of bowtie2 – it it can be a powerful tool to sift out the alignments you want from a messy dataset with limited information, but doing so requires careful tuning of the parameters, which can take quite a few trials to figure out.

Exercise: About how many records in human_mirnaseq.sam represent aligned reads?

Expand

title	Solution

We can use our cut / grep trick from Exercise #1, but on the human_mirnaseq.sam file. Since all read names in this file start with TUPAC, we'll use that pattern to select non-header lines.

Code Block

language	bash
title	Count aligned SAM records

grep -P '^TUPAC' human_mirnaseq.sam | cut -f 3 | grep -v '*' | wc -l

This expressions returns 221086.

Use sort and uniq to create a histogram of mapping qualities

The mapping quality score is in field 5 of the human_mirnaseq.sam file. We can do this to pull out only that field:

Code Block

language	bash
title	Cut mapping quality SAM field

grep -P '^TUPAC' human_mirnaseq.sam | cut -f 5 | head

We will use the uniq create a histogram of these values. The first part of the --help for uniq says:

Code Block

title	What uniq does

Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

With no options, matching lines are merged to the first occurrence.
Mandatory arguments to long options are mandatory for short options too.
  -c, --count           prefix lines by the number of occurrences

To create a histogram, we want to organize all equal mapping quality score lines into an adjacent block, then use uniq -c option to count them. The sort -n command does the sorting into blocks (-n means numerical sort). So putting it all together, and piping the output to a pager just in case, we get:

Code Block

language	bash
title	Cut mapping quality SAM field

grep -P '^TUPAC' human_mirnaseq.sam | cut -f 5 | sort -n | uniq -c | more

Exercise: What is the flaw in this "program"?

Expand

title	Answer

We are looking at mapping quality values for both aligned and un-aligned records, but mapping quality only makes sense for aligned reads. This expression does not distinguish between mapping quality = 0 because the read mapped to multiple locations, and mapping quality = 0 because the sequence did not align.

The proper solution will await the use of samtools to filter out unmapped reads.

Exercise #3: BWA-MEM - Human mRNA-seq

...

Page tree

Versions Compared

Old Version 58

New Version 59

Key

Use sort and uniq to create a histogram of mapping qualities

Exercise #3: BWA-MEM - Human mRNA-seq