Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
excludeTable of Contents

Overview

Image RemovedImage Added

After raw sequence files are generated (in FASTQ format), quality-checked, and pre-processed in some way, the next step in most NGS pipelines is mapping to a reference genome. For individual sequences, it is common to use a tool like BLAST to identify genes or species of origin. However, a normal NGS dataset will have tens to hundreds of millions of sequences, which BLAST and similar tools are not designed to handle.

...

Expand
titleAnswer
There are 1,184,360 alignment records.

Exercise: How many sequences were in the R1 and R2 FASTQ files combined?

Expand
titleHint

gunzip -c fastq/Sample_Yeast_L005_R1.cat.fastq.gz | echo $((`wc -l` / 2))

Expand
titleAnswer
There were a total of 1,184,360 original sequences

Exercises:

  • Do both R1 and R2 reads have separate alignment records?
  • Does the SAM file contain both aligned and un-aligned reads?
  • What is the order of the alignment records in this SAM file?
Expand
titleAnswers
  • Do both R1 and R2 reads have separate alignment records?
    • yes, they must, because there were 1,184,360 R1+R2 reads and an equal number of alignment records
  • Does the SAM file contain both aligned and un-aligned reads?
    • yes, it must, because there were 1,184,360 R1+R2 reads and an equal number of alignment records
  • What is the order of the alignment records in this SAM file?
    • the names occur in the exact same order as they did in the FASTQ, except that they come in pairs
      • the R1 read comes first, then its corresponding R2
    • this ordering is called read name ordering

...