Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titleGet contig name info with cut
grep -P -v'^HWI '^@' yeast_pairedend.sam | cut -f 3 | head

...

Code Block
languagebash
titleFilter contig name of * (unaligned)
grep -P -v '^@^HWI' yeast_pairedend.sam | cut -f 53 | grep -v '*' | head

Perfect! We're only seeing real contig names that (usually) represent aligned reads. Let's count them by piping to wc -l (and omitting omit head of course – we want to count everything).

Code Block
languagebash
titleCount unaligned aligned SAM records
grep -P -v '^@^HWI' yeast_pairedend.sam | cut -f 53 | grep -v '*' | wc -l

Exercise: About how many records represent aligned sequences? What alignment rate does this represent?

...

Expand
titleWhat's going on?

Parameters are:

  • --local – local alignment mode
  • -L 16 – seed length 16
  • -N 1 – allow 1 mismatch in the seed
  • -x mb20/hairpin_cDNA_hsa.fa – prefix path of index files
  • -Ufq/human_mirnaseq.fastq.gz – FASTQ file for single-end (Unpaired) alignment
  • -Shuman_mirnaseq.sam – tells bowtie2 to report alignments in SAM format to the specified file

...

Create a commands file called bt2.cmds with this task definition then generate and submit a batch job for it (time 1 hour, normal queue).

Expand
titleWhat's going on?

Use nano to create the bt2.cmds file. Then:

Code Block
languagebash
titleLocal bowti2 alignment of miRNA data
launcher_creator.py -n bt2 -j bt2.cmds -t 01:00:00 -q normal
sbatch bt2.slurm; showq -u

...

Such is the nature of bowtie2 – it it can be a powerful tool to sift out the alignments you want from a messy dataset with limited information, but doing so requires careful tuning of the parameters, which can take quite a few trials to figure out.

Exercise: About how many records in human_mirnaseq.sam represent aligned reads?

Expand
titleSolution

We can use our cut / grep trick from Exercise #1, but on the human_mirnaseq.sam file. Since all read names in this file start with TUPAC, we'll use that pattern to select non-header lines.

Code Block
languagebash
titleCount aligned SAM records
grep -P '^TUPAC' human_mirnaseq.sam | cut -f 3 | grep -v '*' | wc -l

This expressions returns 221086.

Use sort and uniq to create a histogram of mapping qualities

The mapping quality score is in field 5 of the human_mirnaseq.sam file. We can do this to pull out only that field:

Code Block
languagebash
titleCut mapping quality SAM field
grep -P '^TUPAC' human_mirnaseq.sam | cut -f 5 | head

We will use the uniq create a histogram of these values. The first part of the --help for uniq says:

Code Block
titleWhat uniq does
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

With no options, matching lines are merged to the first occurrence.
Mandatory arguments to long options are mandatory for short options too.
  -c, --count           prefix lines by the number of occurrences

To create a histogram, we want to organize all equal mapping quality score lines into an adjacent block, then use uniq -c option to count them. The sort -n command does the sorting into blocks (-n means numerical sort). So putting it all together, and piping the output to a pager just in case, we get:

Code Block
languagebash
titleCut mapping quality SAM field
grep -P '^TUPAC' human_mirnaseq.sam | cut -f 5 | sort -n | uniq -c | more

Exercise: What is the flaw in this "program"?

Expand
titleAnswer

We are looking at mapping quality values for both aligned and un-aligned records, but mapping quality only makes sense for aligned reads. This expression does not distinguish between mapping quality = 0 because the read mapped to multiple locations, and mapping quality = 0 because the sequence did not align.

The proper solution will await the use of samtools to filter out unmapped reads.

Exercise #3: BWA-MEM - Human mRNA-seq

...