Overview

As discussed throughout the class, long reads are recently becoming cheaper and more accurate making them more available and more useful for NGS applications. This tutorial is designed to build on the same principles (though using some different tools in some cases) to familiarize your self with, and improve a set of long reads.

Learning Objecties

This turorial introduces you to long read fastq files generated from oxford nanopore data, and compares such data to short read Illumina data in light of what you have learned throughout the course. After completing this tutorial you should:

  1. Be able to make high level comparisons of long read and short read data
  2. Be able to Interrogate long read data
  3. Trim data (as appropriate)
  4. Filter data (as appropriate)

Conda

Depending on what optional tutorials you have worked on, you may have created several different tutorials which contain the several of the programs used later in this tutorial. As environments are set around programs not around data, it should make sense that you can use the same environments that you created for short reads with long reads. This does not mean that the "best" tool for short reads is the "best" tool for long reads or even "operational" with long reads and vice versa. This tutorial is written assuming you create the following new environment, you can of course skip this and change out your environment as needed for whatever parts of the tutorial you are interested in trying for yourself.

conda create --name GVA-ReadQC-long -c conda-forge -c bioconda fastqc seqfu cutadapt porechop filtlong


Get some data

Possible errors on idev nodes

As mentioned throughout the course, you can not copy from the BioITeam (because it is on corral-repl) while on an idev node. Logout of your idev session, copy the files.

Recursively copy entire directory
mkdir $SCRATCH/GVA_nanopore
cp -r $BI/gva_course/nanopore/raw_reads/ $SCRATCH/GVA_nanopore/raw_reads
Use wildcards to copy entire directory contents
mkdir -p $SCRATCH/GVA_nanopore/raw_reads
cp $BI/gva_course/nanopore/raw_reads/* $SCRATCH/GVA_nanopore/raw_reads
 

The choice between the 2 boxes is a quetion of style and preference, as you continue on you may find yourself favoring one over the other.

The files downloaded are from a single sample and are the raw files provided after on-instrument calling in an mk1c instrument. In the next section we will begin to interrogate them.

Basic data handling


34


You can tell this by

  1. looking at the largest numbered file name (FAL60196_pass_barcode01_6e61df78_46d0b948_33.fastq.gz) and noticing that numbering starts with "0"
  2. using a command like: ls | wc -l


4,000 OR 952

An command might look like: zgrep -c "^+$"  *.gz

The highest numbered file has a read count of 952 while the rest have exactly 4,000 reads, this may do the following:

  1. make you suspicious when working with data from another source that you are missing some if the read counts are a multiple of 4,000
  2. be useful as a way of filtering reads (discussed later in the tutorial)

This may be the only place in the course where file compression options are explicitly discussed. There are multiple different ways that files can be compressed though throughout the course, we only work with gzip. This allows us to take advantage of zgrep (as above) but also the ability to quickly combine gzipped files. You may be able to think of uses for this with paired end reads if you have sequenced the same sample on multiple runs. Be careful when doing so, as most programs that use paired end information only do so by comparing line by line between the paired end files, not actually checking the header information for pairs. Other compression programs (zip, bzip2, xz) do not offer this functionality.

In the case of long read sequencing, we are working with single end sequencing and while there may be reasons to keep these partial files, quality assessment is more logical if done on the entire data set. Additionally, long read sequencing is single ended, so the order that the files appear in the combined file does not actually matter. 

A quick reminder on interrupting console

Recall that "control + c" will stop whatever command you are currently running. This is mentioned here to highlight the importance of the ">" mark in the next code block.



cd $SCRATCH/GVA_nanopore/
cat raw_reads/*.gz > barcode01.combined.fastq.gz

132,952

An command might look like: zgrep -c "^+$"  barcode01.combined.fastq.gz

Quality Assessment 

Just like when you were first introduced to short read fastq files, it is very common to want to quickly get first impressions of the data you are working with. Again, we will use fastQC but also seqfu which gives additional metrics of our file.

fastQC

Notice that even though there are only ~130k reads, fastqc takes much longer to run than it did on our bacterial samples as the total bases in the file are higher
fastqc barcode01.combined.fastq.gz

After transferring the barcode01.combined_fastqc.html file back to your computer, being at least a little familiar with short read data, you may have questions such as:

  1. Why are the quality scores so much lower?
    1. the quality scores associated with nanopore are simply different than that of illumina, so the "good, ok, bad" notation still doesn't help.
    2. https://pubmed.ncbi.nlm.nih.gov/34597327/ goes into much more detail on this.
  2. Why is per base sequence content so blocky at start before smoothing out and getting blocky near the end again?
    1. Nanopore barcoding is done as the first bp of each read, so if anything it is surprising that the first bases are not even higher frequency.
  3. Why does the per sequence gc content have a left shoulder?
    1. I expect this is caused by the gc content of the adapter sequence present on all reads being higher than the organism overall and 100% of the reads having the adapter sequence as the first bases, additionally the nature of a transposon generated library means a large number of reads also have the same barcode sequence on their 3' terminal end for those reads which are completely sequenced by the pore.
  4. Why does the length distribution look backwards?
    1. Illumina's sequencing by synthesis design means that there are only enough reagents available to sequence a given number of cycles (bases) so there is a maximum length
    2. Nanopore is all about the fragment length and the fragment transversing the pore so fewer reads will be as long. In the next section you will determine what the minimum read length is to be considered 'passing filters' and thus included in the output
  5. How can there be no overrepresented sequences if 100% of the reads are supposed to start with the same barcode sequence, and why it not identify the adapter sequence?
    1. In both cases (like with quality score determination in 1.) this is related to fastqc having expecations centered around illumina sequencing. Interestingly, you may notice a slight tick up in polyA line this suggests that failed pores may automatically read as "A". 

Seqfu

If you installed seqfu in a new conda environment at the start of this tutorial you can also use it to look at some overall sequencing stats. In my own work, I find this particularly useful for comparing across samples and trying to correlate assembly or variant calling outcomes with these metrics. Luckily the command is very simple and with a single sample it does not take too long.

This same command can be used to check the distributions of each of the 4k read files, or on the combined file depending on which directory you are in.
seqfu stats -n -v *.gz --csv

For the combined reads from this we can see:

  1. The minimum length to pass filter is probably 100bp.
    1. We only actually detect reads at 104 and above, but 104 as a cut off makes no sense.
  2. The largest read is ~103kb, about 30x larger than the average read.
  3. The information regarding % of reads which comprise a given fraction of the total length as discussed in the presentation.
    1. Again, AuN is likely something you aren't familiar with but https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity is a good resource for begining t undertand it. 

Quality Control

Given the information you have about the data set, there are a few things that can be done to improve the data set. Take a moment to consider what we saw above in the fastqc and seqfu results to consider how you might do this before continuing.

Adapter Trimming

Ultimately, this is something that is not routinely done and does not appear to interfere with data analysis (citations needed). Unfortunately, our favorite trimming tool fastp does not work on nanopore data, nor do other programs like trimmomatic.

porechop

https://github.com/rrwick/Porechop is a tool specifically developed and used for removing adapter sequences from nanopore libraries. Unfortunately can be quite slow and is abandoned by the creator for difficulties in keeping it up and running as you can see on the github page. That being said the tool is functional and it does work. It also provides an option to remove reads which detect the presence of a different barcode sequence in the middle of it which seems like a safer bet. As mentioned, this command can take quite a bit of time to complete so consider modifying the following command to work only on one of the raw sequencing files that only has 4,000 reads in it if you are just looking for an overview of the program and dont plan to run the assembly afterwords.

Suggested command
porechop -t 48 -i barcode01.combined.fastq.gz -o barcode01.combined.adapter-trimmed.fastq.gz --discard_middle

From the output you can notice the following about what adapters are present

Trimming adapters from read ends
  Rapid_adapter: GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA
       BC01_rev: CACAAAGACACCGACAACTTTCTT
           BC01: AAGAAAGTTGTCGGTGTCTTTGTG
       BC02_rev: ACAGACGACTACAAACGGAATCGA
           BC02: TCGATTCCGTTTGTAGTCGTCTGT
           BC01: AAGAAAGTTGTCGGTGTCTTTGTG
       BC01_rev: CACAAAGACACCGACAACTTTCTT
           BC02: TCGATTCCGTTTGTAGTCGTCTGT
       BC02_rev: ACAGACGACTACAAACGGAATCGA

Obviously there is some concern about the presence of barcode 1 and barcode 2 sequences in a single sample. I don't have an explanation for how this can happen either relating to the library prep, nor how it happens with the sequencer splitting the reads based on the presence of this signal, but suspect it may be related to the middle adapter situation. Reads with an adapter in the middle of the read are quite rare:

2,459 / 132,952 reads were discarded based on middle adapters

In total more than 14 Mbp were removed:

132,715 / 132,952 reads had adapters trimmed from their start (13,594,659 bp removed)
 19,374 / 132,952 reads had adapters trimmed from their end (708,769 bp removed)



Cutadapt

As mentioned in the fastqc results, the first few bases are directly related to the barcode sequence present, and as mentioned in many other tutorials when you can get rid of these artifacts, its usually a good idea. In the initial readqc tutorial we used cutadapt to remove low quality bases from the 3prime end of the read, but it can just as easily be used to remove bases from the 5prime end. 

Can you figure out what option you need to do this using cutadapt --help? (don't forget you also want to the output to go somewhere other than your terminal
cutadapt -o barcode01.combined.trimmed.fastq.gz -u 9 barcode01.combined.fastq.gz


~1.2M based on:

Total basepairs processed:   438,208,826 bp

Total written (filtered):    437,012,258 bp (99.7%)


Filtering

As mentioned in several of the presentations, you dont need super high coverage for most applications (and when you do need high coverage you often need to compete with error rates). Therefore it is often beneficial to use only the best of the best reads. filtlong (https://github.com/rrwick/Filtlong) is a program that allows you to set various thresholds to get these divisions. Those who are extra observant this is the same developer as the one who created porechop.

What filtering needs to be done will largely be dependent on the downstream applications, and a single dataset may even be filtered different ways for different phases of the analysis (ie only the longest reads used to assemble, while the highest quality are used to polish an assembly removing as many SNV as possible). One of the benefits of this program beyond its capabilities is that the github page is extremely descriptive allowing easy understanding of the commands and what you need to modify. For now focus on the example command for having no reference available https://github.com/rrwick/Filtlong#example-commands-detailed. Since we are under 500 Mbp, you should ignore that option. 

What command did you come up with?
filtlong --min_length 1000 --keep_percent 90 barcode01.combined.fastq | gzip > barcode01.combined.min1k.top90.fastq.gz

83,189 seqfu also shows an increase of ~50% in average read length.


Next Steps

Consider trimming in different ways, or making more or less stringent filtered sets, and moving onto the next tutorials.


Return to the GVA 2023 course page

  • No labels