Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The FASTX Toolkit provides a set of command line tools for manipulating both FASTA and FASTQ files. The available modules are described on their website. They include a fast fastx_trimmer utility for trimming FASTQ sequences (and quality score strings) before alignment.

...

Data staging

Set up to process the yeast data if you haven't already.

...

Note that the FASTX Toolkit also has programs that work on FASTA files. To see them, type fasta_ then tab twice (completion) to see their names.

Adapter trimming with cutadapt

Data from RNA-seq or other library prep methods that result in short fragments can cause problems with moderately long (50-100bp) reads, since the 3' end of sequence can be read through to the 3' adapter at a variable position. This 3' adapter contamination can cause the "real" insert sequence not to align because the adapter sequence does not correspond to the bases at the 3' end of the reference genome sequence.

...

Expand
titleAnswer
Providing --error-rate=0.05 (or -e 0.05) as an option, for example, would specify a 5% error rate, or no more than 1 mismatching base in 20.

cutadapt example

Let's run cutadapt on some real human miRNA data.

...

  • The Total reads processed line tells you how many sequences were in the original FASTQ file.
  • Reads with adapters tells you how many of the reads you gave it had at least part of an adapter sequence that was trimmed.
    • Here adapter was found in nearly all (98.4%) of the reads. This makes sense given this is a short (15-25 bp) RNA library.
  • The Reads that were too short line tells you how may sequences were filtered out because they were shorter than our minimum length (20) after adapter removal (these may have ben primer dimers).
    • Here ~13% of the original sequences were removed, which is reasonable.
  • Reads written (passing filters) tells you the total number of reads that were written out by cutadapt
    • These are reads that were at least 20 bases long after adapters were removed

paired-end data considerations

Special care must be taken when removing adapters for paired-end FASTQ files.

  • For paired-end alignment, aligners want the R1 and R2 fastq files to be in the same name order and be the same length.
  • Adapter trimming can remove FASTQ sequences if the trimmed sequence is too short
    • but different R1 and R2 reads may be discarded
    • this leads to mis-matched R1 and R2 FASTQ files, which can cause problems with aligners like bwa
  •  cutadapt has a protocol for re-syncing the R1 and R2 when the R2 is trimmed.

running cutadapt in a batch job

Now we're going to run cutadapt on the larger FASTQ files, and also perform paired-end adapter trimming on some yeast paired-end RNA-seq data.

...