Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

At the end of the Tophat process, you have a BAM file describing the alignment of your data to genomic coordinatesthe input data to genomic coordinates.

FASTQ preparation

Although we won't cover these issues here, there are some issues you should consider before embarking on the Tuxedo pipeline:

  1. Should my FASTQ sequences be trimmed to remove low-quality 3' bases?
    Expand
    Suggestion
    Suggestion

    Possibly, if FastQC or other base quality reports show the data is really poor. But generally the fact that Tophat splits long reads into smaller fragments mitigates the need to do this.

  2. Should I remove adapter sequences before running Tophat?
    Expand
    Suggestion
    Suggestion

    This is usually a good idea because un-template adapter bases have a more drastic effect on reducing mappability than do low-quality 3' bases.

  3. Should I attempt to remove sequences that map to undesired RNAs before running Tophat? (rRNA for example)
    Expand
    Suggestion
    Suggestion

    This is also usually a good idea, because such rRNA sequences can be a substantial proportion of your data (depending on library prep method), and this can skew cuffdiff's fragment counting statistics.

  4. How would, for example, rRNA sequence removal be done?
    Expand
    Suggestion
    Suggestion

    Maybe something like this:

    • Align your sequences to a reference "genome" consisting only of rRNA gene sequences.
    • Extract only the sequences that do not align to the rRNA reference into a new FASTQ file and use that as Tophat input.
  5. What other pre-processing steps might I consider?
    Expand
    Suggestion
    Suggestion

    There are many, and it will depend on your data and what you want to get out of it.

    If you have paired-end data, tophat asks you to provide the mean fragment (insert) size and the standard deviation for insert sizes in your library. One common pre-processing step to achieve this would be to do a quick paired-end alignment of, for example, about 1 million sequences to a reference genome. Then you could calculate the mean and standard deviation of insert sizes for properly paired reads from the resulting BAM file records, and pass these values to Tophat.

Some Logistics...

Six raw data files were provided as the starting point:

...