...
Code Block |
---|
language | bash |
---|
title | Get contig name info with cut |
---|
|
grep -P -v'^HWI '^@' yeast_pairedend.sam | cut -f 3 | head |
...
Code Block |
---|
language | bash |
---|
title | Filter contig name of * (unaligned) |
---|
|
grep -P -v '^@^HWI' yeast_pairedend.sam | cut -f 53 | grep -v '*' | head |
Perfect! We're only seeing real contig names that (usually) represent aligned reads. Let's count them by piping to wc -l (and omitting omit head of course – we want to count everything).
Code Block |
---|
language | bash |
---|
title | Count unaligned aligned SAM records |
---|
|
grep -P -v '^@^HWI' yeast_pairedend.sam | cut -f 53 | grep -v '*' | wc -l |
Exercise: About how many records represent aligned sequences? What alignment rate does this represent?
...
Expand |
---|
|
Parameters are: - --local – local alignment mode
- -L 16 – seed length 16
- -N 1 – allow 1 mismatch in the seed
- -x mb20/hairpin_cDNA_hsa.fa – prefix path of index files
- -Ufq/human_mirnaseq.fastq.gz – FASTQ file for single-end (Unpaired) alignment
- -Shuman_mirnaseq.sam – tells bowtie2 to report alignments in SAM format to the specified file
|
...
Create a commands file called bt2.cmds with this task definition then generate and submit a batch job for it (time 1 hour, normal queue).
Expand |
---|
|
Use nano to create the bt2.cmds file. Then: Code Block |
---|
language | bash |
---|
title | Local bowti2 alignment of miRNA data |
---|
| launcher_creator.py -n bt2 -j bt2.cmds -t 01:00:00 -q normal
sbatch bt2.slurm; showq -u |
|
...
Such is the nature of bowtie2 – it it can be a powerful tool to sift out the alignments you want from a messy dataset with limited information, but doing so requires careful tuning of the parameters, which can take quite a few trials to figure out.
Exercise: About how many records in human_mirnaseq.sam represent aligned reads?
Expand |
---|
|
We can use our cut / grep trick from Exercise #1, but on the human_mirnaseq.sam file. Since all read names in this file start with TUPAC, we'll use that pattern to select non-header lines. Code Block |
---|
language | bash |
---|
title | Count aligned SAM records |
---|
| grep -P '^TUPAC' human_mirnaseq.sam | cut -f 3 | grep -v '*' | wc -l |
This expressions returns 221086. |
Use sort and uniq to create a histogram of mapping qualities
The mapping quality score is in field 5 of the human_mirnaseq.sam file. We can do this to pull out only that field:
Code Block |
---|
language | bash |
---|
title | Cut mapping quality SAM field |
---|
|
grep -P '^TUPAC' human_mirnaseq.sam | cut -f 5 | head |
We will use the uniq create a histogram of these values. The first part of the --help for uniq says:
Code Block |
---|
|
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).
With no options, matching lines are merged to the first occurrence.
Mandatory arguments to long options are mandatory for short options too.
-c, --count prefix lines by the number of occurrences |
To create a histogram, we want to organize all equal mapping quality score lines into an adjacent block, then use uniq -c option to count them. The sort -n command does the sorting into blocks (-n means numerical sort). So putting it all together, and piping the output to a pager just in case, we get:
Code Block |
---|
language | bash |
---|
title | Cut mapping quality SAM field |
---|
|
grep -P '^TUPAC' human_mirnaseq.sam | cut -f 5 | sort -n | uniq -c | more |
Exercise: What is the flaw in this "program"?
Expand |
---|
|
We are looking at mapping quality values for both aligned and un-aligned records, but mapping quality only makes sense for aligned reads. This expression does not distinguish between mapping quality = 0 because the read mapped to multiple locations, and mapping quality = 0 because the sequence did not align. The proper solution will await the use of samtools to filter out unmapped reads. |
Exercise #3: BWA-MEM - Human mRNA-seq
...