There are a number of open source tools that can trim off 3' bases and produce a FASTQ file of the trimmed reads to use as input to the alignment program.
The FASTX-Toolkit provides a set of command line tools for manipulating fasta and fastq files. The available modules are described on their website. They include a fast fastx_trimmer utility for trimming fastq sequences (and quality score strings) before alignment.
FASTX-Toolkit is available via the TACC module system.
#remember to load load biocontainers module if you haven't already- fastx_toolkit is part of that module spider fastx module load fastx_toolkit |
Let's run fastx_trimmer to trim all input sequences down to 90 bases:
fastx_trimmer -i data/Sample1_R1.fastq -l 90 -Q 33 -o Sample1_R1.trimmed.fastq |
fastq_quality_filter -q <N> -p <N> -i <inputfile> -o <outputfile> -q N: Minimum Base quality score -p N: Minimum percent of bases that must have [-q] quality |
Let's try it on our data- trim it to only include reads with atleast 80% of the read having a quality score of 30 or above.
fastq_quality_filter -q 20 -p 80 -i data/Sample1_R1.fastq -Q 33 -o Sample1_R1.filtered.fastq |
grep '^@HWI' Sample1_R1.trimmed.fastq |wc -l grep '^@HWI' Sample1_R1.filtered.fastq |wc -l |
Unlike general fixed-length trimming (e.g. trimming 100 bp sequences to 40 or 50 bp), adapter trimming removes differing numbers of 3' bases depending on where the adapter sequence is found.
The GSAF website describes the flavaors of Illumina adapter and barcode sequence in more detail https://wikis.utexas.edu/display/GSAF/Illumina+-+all+flavors
For old ligation-based RNA-libraries, the R1 adaptor is different from the R2 adaptor. For current stranded RNA libraries (such as dUTP, used by GSAF), the R1 adaptor would be the same as the R2 adaptor. Fastqc reports will give you an idea about which adaptor is present in your data. Further, it's always a good idea to grep -c <partofdaptorseq> <fastqfile> to make sure you have the right adaptor sequence before trimming. |
One of the programs available as part of the fastx toolkit does a crude job of clipping adaptors out of sequences.
fastx_clipper will clip a certain nucl. sequence (eq: adapter) from your reads.
fastx_clipper -a <adapter> -i <inputfile> -o <outputfile> -l <discardSeqsShorterThanN> |
fastx_clipper -a GATCGGAAGAGCACACGTCTGAACTCCA -i data/Sample1_R1.fastq -o Sample1_R1.cutadapt.fastq -l 20 |
More sophisticated tools like Trimgalore and cutadapt may be suitable, particularly with dealing with paired end data.
module load cutadapt cutadapt -a GATCGGAAGAGCACACGTCTGAACTCCA -o <R1_outputfile> -p <R2_outputfile> <inputR1file> <innputR2file> |
Please refer to https://wikis.utexas.edu/display/GSAF/Illumina+-+all+flavors for Illumina library adapter layout. Below information holds true from old ligation-based RNA-libraries. In this case, the R1 adaptor is different from the R2 adaptor. For current stranded RNA libraries (such as dUTP, used by GSAF), the R1 adaptor would be the same as the R2 adaptor. Fastqc reports will give you an idea about which adaptor is present in your data. Further, it's always a good idea to grep -c <partofdaptorseq> <fastqfile> to make sure you have the right adaptor sequence before trimming. |
Let's look at how we can aggregate multiple FastQC reports into one report with MultiQC
BACK TO THE COURSE OUTLINE