Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
cut -f 1 alignments/human_rnaseq_mem.sam | sort | uniq -c | less	#This gives you a view where each read is listed next to the number of entries it has in the SAM file
cut -f 1 human_rnaseq.sam | sort | uniq -c | awk '{print $1}' | sort | uniq -c | less	#This gives essentially a histogram of the number of times each read aligned - a plurality of reads aligned twice, which seems reasonable since these are all reads crossing a junction, but plenty aligned more or less
cut -f 1 human_rnaseq.sam | sort | uniq | wc -l	#This gives a better idea of the alignment rate, which is how many reads aligned at least once to the genome.  Divided by the number of reads in the original file, the real alignment rate is around 64.19 %.
 
# PLEASE NOTE: some of these one-liners are only reasonably fast if the files are relatively small (around a million reads or less).  For bigger files, there are better ways to get this information, mostly using samtools, which is a utility explicitly designed for manipulating SAM/BAM files.  We'll cover samtools in the next section.

If you want to be able to generate a file with only the best (for BWA MEM, that means the longest) alignment for each read, so that every read can only be represented by one line, the option -M will add a flag to the end of each alignment entry that designates any alignments that are not the longest for their originating read as secondary.  Then, you can filter all reads that do not have that flag.

This alignment rate is pretty good, but it could get better by playing around with the finer details of BWA MEM.  It is also a bit higher if you use an aligner that is more explicitly concerned with sensitivity to splice sites, namely a program like Tophat2.  All Tophat programs use Bowtie or Bowtie2 as the actual aligner, but split up each read into smaller pieces to align individually, then tries to reconstruct reasonable overall alignments from the pieces.  In fact, these reads can all be aligned by using Tophat2 with the right parameters.