Unlocking the full potential of nanopore sequencing: tips, tricks, and advanced data analysis techniques
Daria Meyer, Winfried Goettsch, Jannes Spangenberg, Bettina Stieber, Sebastian Krautwurst, Martin Hölzer, Christian Brandt, Jörg Linde, Christian Höner zu Siederdissen, Akash Srivastava, Milena Zarkovic, Damian Wollny, Manja Marz

TL;DR
This paper explores how to improve nanopore sequencing by offering tips and advanced data analysis techniques to maximize its potential.
Contribution
The paper provides comprehensive guidelines and statistically supported insights to address challenges in nanopore sequencing.
Findings
Nanopore sequencing allows real-time analysis of long nucleic acid strands in portable devices.
The paper identifies persistent challenges in protocols and technology that hinder optimal results.
It offers interdisciplinary approaches to enhance information gain from nanopore sequencing.
Abstract
Nucleic acid sequencing is the process of identifying the sequence of DNA or RNA, with DNA used for genomes and RNA for transcriptomes. Deciphering this information has the potential to greatly advance our understanding of genomic features and cellular functions. In comparison to other available sequencing methods, nanopore sequencing stands out due to its unique advantages of processing long nucleic acid strands in real time, within a small portable device, enabling the rapid analysis of samples in diverse settings. Evolving over the past decade, nanopore sequencing remains in a state of ongoing development and refinement, resulting in persistent challenges in protocols and technology. This article employs an interdisciplinary approach, evaluating experimental and computational methods to address critical gaps in our understanding in order to maximize the information gain from this…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11| Estimate | Std. Error |
| Pr( | Sig. | |
|---|---|---|---|---|---|
| (Intercept) |
| 9.49 |
| 0.590140 | |
| Starting pores | 3.04 | 8.59 | 3.541 | 0.000736 | *** |
| Starting channels |
| 3.28 |
| 0.837636 | |
| Pore half time | 1.14 | 1.74 | 6.532 | 1.09 | *** |
| Library loaded |
| 4.11 |
| 0.716424 | |
| Sample - human |
| 5.52 |
| 0.402119 | |
| Sample - insect |
| 7.30 |
| 0.396922 | |
| Sample - metagenomic |
| 4.39 |
| 0.008470 | ** |
| Sample - mouse |
| 5.09 |
| 0.431852 | |
| Sample - protist |
| 6.71 |
| 0.257067 | |
| Sample - virus |
| 5.91 |
| 0.932298 | |
| RNA/DNA - RNA |
| 6.25 |
| 1.38 | *** |
| Buffer - SFB |
| 4.25 |
| 0.692883 | |
| Buffer - RNA | NA | NA | NA | NA | |
| Fc version - R10 |
| 5.52 |
| 0.004208 | ** |
| Read length | 1.04 | 4.15 | 0.250 | 0.803204 |
- —Deutsche Forschungsgemeinschaft10.13039/501100001659
- —German DFG Collaborative Research Centre AquaDiva
- —German state of Thuringia via the Thüringer Aufbaubank
- —Ministry for Economics, Sciences, and Digital Society of Thuringia
- —Landesprogramm ProDigital
- —Friedrich-Schiller University Jena
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNanopore and Nanochannel Transport Studies · Genomics and Phylogenetic Studies · Single-cell and spatial transcriptomics
Introduction
Nanopore sequencing is a transformative technology in genomics, offering the unique ability to sequence DNA or RNA molecules in their native form. Nanopore sequencing can generate long sequencing reads by measuring disturbances in the ion current as biological molecules such as DNA and RNA pass through a nanopore [1, 2]. During or after sequencing, the raw DNA/RNA signal can be transformed into nucleotide sequences. This capability provides invaluable insights into genetic variation and molecular modifications. Since the debut of the MinION device by Oxford Nanopore Technologies (ONT) in 2014, nanopore sequencing has experienced a surge in popularity and has become a valuable technique in genomic research. For DNA sequencing [2], it is increasingly applied in diagnostics [3–5], whole genome assembly [6], and for metagenomic assemblies [7–10], enabling the study of microbial communities and their functions. In direct RNA sequencing, it supports diverse tasks such as de novo transcriptome assembly, isoform expression quantification [11], and the direct detection of RNA modifications [12]. These applications highlight the technology’s versatility and its ability to address a wide range of biological questions.
One of the key advantages of nanopore sequencing is its ability to generate data rapidly, facilitated by quick library preparation and real-time data acquisition during sequencing [13]. Additionally, nanopore sequencing can directly sequence native RNA molecules without the need for reverse transcription or amplification [11, 14]. Its capacity for long-read sequencing, where fragments up to two megabases in length can be read in a single pass [15], further enhances its utility by providing comprehensive genomic and transcriptomic insights. Furthermore, due to their portability and affordability, devices such as the MinION [16] are particularly valuable for applications such as monitoring virus outbreaks, where fast and on-site sequencing is essential.
Nanopore sequencing offers many advantages but comes with technical challenges. When first introduced, the technology was limited by lower accuracy ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \sim\end{document} 60%) and throughput, though these shortcomings have improved significantly over time [17]. With the introduction of ONT’s MinION device in 2014, accuracy improved to 89.5% [18], with current rates reaching 99% [19]. While in 2021 nanopore sequencing continued to exhibit a higher error rate compared to Illumina sequencing [2], in 2024 sequencing of nearly complete bacterial genomes without short-read or reference polishing became possible [20, 21]. Nonetheless, sequencing accuracy remains a key consideration for certain high-precision applications, particularly in direct RNA sequencing [22].
Another challenge lies in the customization of nanopore wet lab workflows. While numerous reviews provide detailed overviews of tools for standard nanopore sequencing applications [13, 23–25], the field’s relative novelty has given rise to a broad range of non-standard approaches aimed at maximizing sample utility. The versatility of nanopore sequencing lends itself to customization to meet specific requirements, including the use of various flow cells, library preparation kits, sequencing buffers, and an extensive set of computational analysis tools. The sheer number of options can be daunting for newcomers. Optimizing protocols, designing libraries, and selecting parameters require careful consideration. Compounding this issue is the prevalence of unverified claims in the field, as many methods lack experimental validation, making them challenging to integrate into standardized workflows. This is particularly problematic when information is available solely through the Nanopore Community (https://community.nanoporetech.com) and lacks proper citation.
This perspective work integrates insights gained from over 300 nanopore sequencing runs, which span a variety of species, sequencing devices, preparation methods, and sequencing protocols. These data-driven findings provide valuable guidance for addressing common challenges in nanopore sequencing. For instance, we show that sequencing performance is influenced more by the number of starting pores and sample type than by read length, flow cell age, or the amount of library loaded onto the flow cell. Techniques such as flow cell washing and adaptive sampling [26] have been demonstrated to enhance output and improve sequencing yield. Additionally, we offer practical advice for optimizing library preparation, particularly for obtaining long reads and adjusting loading amounts when working with small sample sizes. We also address the complexities of ONT data analysis, providing guidelines for constructing customized bioinformatics pipelines that align with specific experimental goals. Effective methods for calling DNA and RNA modifications, as well as strategies for normalizing raw signal data, are also discussed. By bridging the gap between experimental data and best practices, this review intends to equip researchers with actionable strategies for optimizing nanopore sequencing experiments.
Background on nanopore hardware
At the heart of nanopore sequencing is a protein channel (nanopore) through which DNA or RNA molecules can pass (Fig. 1). Electrical sensors are sensitive to electrical changes as molecules move through the nanopore [27]. Flow cells are filled with an electrolyte solution that establishes a relatively constant electrical current through the nanopore in the absence of DNA or RNA. The sensor continuously records this current as a raw electrical signal in picoamperes (pA) [28–30]. During library preparation, a motor protein is attached to the nucleotide strand, which guides the DNA or RNA molecule though the pore and unwinds it while controlling the pace at which the strand passes through, ensuring accurate sequencing [2]. As each DNA or RNA molecule passes through a nanopore, it disrupts the electrical current in a unique way based on the sequence moving through the pore. In R10 and RNA flowcells, 9 nt in the nanopore generate the signal, while in older R9 flowcells, 5 nt contributed [31, 32]. The recorded electrical signal from the sensor, frequently referred to as the “squiggle,” contains the data used in downstream processing [13].
Nanopore sequencing on a MinION. A MinION flow cell contains 512 channels with 4 nanopores in each channel, resulting in 2048 nanopores available for sequencing DNA or RNA. Each sensor well (90 μm deep) holds a single nanopore and a platinum electrode (at the base of the well). The electrode is controlled via the printed circuit board and measures the ion flow as ionic current. During sequencing, a motor protein guides the DNA/RNA through the nanopore following the electric gradient. The DNA/RNA disrupts the current by partially blocking the ion flow. This can be measured in a resulting squiggle.
Each flow cell contains hundreds to thousands of nanopores, which are grouped into channels of four pores. Flow cells are the standard, full-sized sequencing units, which generate tens to hundreds of gigabases (Gb) of data [33].
Different nanopore sequencing devices exist, including the MinION, GridION, P2, and the PromethION. Combined with their respective flow cell(s), ranging from low throughput and yield Flongles to high throughput and yield PromethION flow cells, they enable sequencing for a wide variety of use cases. A standard MinION flow cell has 2048 pores organized in 512 channels [2].
Flow cells differ in the design of the nanopores, which impacts their performance mainly in terms of sequencing accuracy [34]. In this work, we used R9 and R10 flow cells. R9 flowcells use single-constriction pores, which were designed to be robust, reliable, and capable of handling a wide range of sequencing tasks.
The R10 flow cells feature double-constriction pores, where the nucleotide strand passes through two constriction points instead of one. The double-constriction design enhances the resolution of basecalling and can achieve higher raw sequencing accuracy [35]. At the start of each sequencing run, an initial flow cell check is performed, providing the starting number of available pores [36, 37]. Each pore can only process a limited number of molecules before it becomes inactive. Therefore, the number of available pores decreases over time [35, 36]. Usually, even a new flow cell never has the maximum number of pores available. A warranty is provided by ONT that allows flow cells to be returned if they do not have enough available pores within 3 months of purchase (https://nanoporetech.com/document/flow-cell-check).
The pores in the flow cell are controlled by multiplexing (mux), meaning that only a subset of nanopores is actively sequencing at any given time. Each channel is linked to four wells (or mux groups), but only one well per channel is active at any given moment. A mux scan can identify the most suitable nanopores for continued sequencing by testing multiple mux groups (i.e. groups of nanopores) in different channels [38, 39]. During a mux scan, each well is briefly activated to assess the signal quality and classify wells based on pore availability and activity. The scan can detect issues such as blocked pores or unavailable wells and selects the best-performing well for each channel. If needed, MinKNOW will switch to a different well to reduce the inactive time and maximize data output. Note that a mux scan interrupts sequencing of the current reads, so the interval between scans should be adjusted when aiming for very long reads.
The default mux scan is performed every 1.5 h, after which the sequencing continues within the channel, and the current pore can also be directly re-selected. The interval for the mux scan can be set in the MinKNOW software at the start of each sequencing run. This process helps to optimize the sequencing run by ensuring that only high-quality pores are actively sequencing and can increase the overall lifetime of the flow cell.
In addition, the flow cell can be washed during sequencing to unblock and reactivate pores that were unavailable before. For washing the flow cell, the sequencing run needs to be paused or stopped. Then, the flow cell is incubated with nuclease to digest the loaded library. Afterwards, the flow cell can be loaded again with the same or another sample, and the sequencing can continue.
For more details, we refer the reader to the nanopore documentation (www.nanoporetech.com/document/hardware).
Data characterization and description
This perspective draws its conclusions from an extensive and diverse dataset encompassing over 300 nanopore sequencing runs. Among these, a specific focus is placed on 241 sequencing runs conducted using R9 and R10 flowcells on the MinION and GridION sequencing devices. This dataset offers robust insights into sequencing performance across various conditions. The study also acknowledges instances of sequencing failures and runs conducted on earlier flow cell types (e.g. flongles or R8 flow cells), although these are not included in the accompanying statistical analyses.
The 241 sequencing runs we focus on (see Supplementary Table S1) include 207 DNA sequencing runs and 34 direct RNA sequencing runs. These runs span a wide range of organisms, including 33 viruses, 45 bacteria, 5 protists, 2 plants, 47 insects, 17 mice, 38 humans, 51 metagenomic, and 3 synthetic samples. The amount of input material loaded onto the flow cells ranged from 42 ng to 1440 ng of DNA/RNA. All the RNA runs described in the current study are from direct sequencing of RNA; complementary DNA (cDNA) was not examined.
The output of sequencing runs was assessed based on several metrics: (i) The sequencing yield, defined as the overall amount of sequenced bases, depends on the runtime of the flow cell. We decided to compare the total number of bases produced after 12 h. We chose 12 h because some runs included in this study were stopped after that time to wash the flow cell. Referred to as EB12 in this study and estimated by MinKNOW, EB12 ranged from \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} 7 \cdot 10^6\end{document} to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} 28 \cdot 10^{9}\end{document} nucleotides. (ii) The median read length, calculated after basecalling, varied significantly, ranging from 419 to 23 746 bases. We chose to calculate the median instead of the mean, as it is more robust against outliers. (iii) Another key metric was the pore half time of the flow cells, defined as the duration of sequencing until only half of the initially active pores remained active. This value ranged from 1 to 59 h, highlighting variations in flow cell durability and performance across experiments. Due to artifacts observed in some flow cells, we manually re-evaluated the pore half time of 22 flow cells, as shown in Supplementary Table S3. (iv) Other flow cell parameters evaluated in this study include the number of active channels, active pores, and the age of the flow cells. The number of active channels, as reported in ONT run reports, ranged from 93 to 512 out of a possible 512 channels. (v) Similarly, the number of active pores varied between 749 and 1984 out of a total of 2048 available pores per run. As the number of active pores can increase shortly after starting sequencing, we used the maximum number of active pores during the first 10 min of each run. (vi) The flow cell age was calculated from the time of their arrival at the laboratory, as the production dates were not accessible. We analyzed flow cells that were used between 2 days and 249 days after their arrival, with ONT’s recommended maximum shelf life being 90 days when stored at 2–8°C. (vii) Last, we investigated how the amount of loaded library (in ng) affected the sequencing yield.
The following analyses provide deeper insights into flow cell performance and durability and have been conducted using custom scripts. Only unwashed sequencing runs were analyzed for the main figures. Runs that included wash steps were analyzed separately. We also tried to address the question, if short reads are sequenced first, which is topic of interest to the community. However, our results were inconsistent in this respect, making it difficult to draw a clear conclusion.
Challenges in library preparation for ONT sequencing
Library preparation for ONT sequencing involves multiple steps. For instance, DNA library preparation using the Ligation Sequencing DNA V14 kit (SQK-LSK114) entails DNA repair and end-preparation, followed by adapter ligation and clean-up, before priming and loading the prepared library onto the flow cell. An alternative, if short preparation times are a main focus, is the Rapid Sequencing kit V14 (SQK-RAD114), which reduces the preparation time to as little as 10 min at the cost of sequencing accuracy. For DNA, multiplexing is possible [40] using the Native Barcoding Kit 96 V14 (SQK-NBD114.96).
Similarly, library preparation for direct RNA sequencing (SQK-RNA004) begins with adapter ligation of the reverse transcriptase adapter, followed by a reverse transcription reaction to stabilize the single RNA strand, and concludes with the ligation of the RNA ligation adapter, priming, and flow cell loading. Each stage of these protocols requires multiple handling and washing steps, such as pipetting, mixing, and centrifugation, which introduce challenges in maintaining nucleic acid length and achieving consistent yields.
Obtaining long reads
To obtain the longest possible DNA or RNA reads, it is crucial to minimize any shearing of the nucleic acids. To avoid shearing during library preparation, it is important to reduce the suction force when pipetting. This can be minimized by using cut tips and pipetting slowly, as recommended by Prall et al. [41]. Additionally, vortexing is considered too harsh for nucleic acids, as it often results in shorter fragments, so it is advised to use gentle tapping for mixing instead [42].
Many DNA/RNA isolation kits, particularly those using bead-beating- or column-based methods, are known to shear and fragment DNA nucleic acids [43, 44]. We recommend the classic phenol-chloroform extraction method, as it preserves longer fragments. Recently, there has been a growing interest in high molecular weight DNA extraction techniques. Innovations in this area include the surface topography of silica lamellae and new solid-phase methods introduced by different brands, which aim to improve yield and ease of handling. These kits are often more user-friendly but typically come at a higher reagent cost [45, 46].
In our analysis of DNA and RNA sequencing reads, we mainly used slow and gentle phenol-chloroform extraction to avoid nucleic acid shearing, to maximize read length, and for RNA to include small non-coding RNAs ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} <\end{document} 200 nucleotides), which are lost in many column-based methods [47]. We achieved notably long reads in both DNA and RNA samples. The longest DNA read obtained was 1.4 Mb, which came from a metagenomic sample (sample ID: 117.1) and was classified as Homo sapiens using kraken2 [48] and had a GC content of 41.4%. While nanopore sequencing is generally more robust in handling GC-rich regions than polymerase chain reaction (PCR)-dependent platforms, high GC content can still affect read quality, base-calling accuracy, and sequencing coverage [49, 50]. For RNA, the longest read was 175 kb from sample ID 103.1, which was classified as bacterial with a GC content of 68.6%.
Adjustment of loading amount
Loading dependent on pore availability
While it may seem intuitive to increase the amount of DNA/RNA loaded to enhance data yield, there is a threshold beyond which additional loading may not yield proportional benefits (https://nanoporetech.com/document/chemistry-technical-document). This is particularly true due to saturation effects caused by a limited number of pores on the flow cell. The quantity of data generated is directly proportional to the initial number of active pores (Fig. 2A) and channels (Fig. 2B). Linear regression showed a positive correlation between the initial number of active pores and EB12 (R^2^ = 0.38, P \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} <\end{document} .001) and between the initial number of active channels and EB12 for DNA (R^2^ = 0.36, P \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} <\end{document} .001). For RNA, we observed a weak positive correlation between the initial number of active pores and EB12 (R^2^ = 0.17, P = .0271), but no significant correlation between the initial number of active channels and EB12 (R^2^ = 0.05, P = .257).
(A) The number of active pores (maximum number of active pores during the first 10 min) at the beginning of the sequencing run seems to correlate with EB12. Note the generally lower number of active pores for RNA runs. (B) The number of active channels at the beginning of the sequencing run has a weaker association with EB12. Although a high yield is only achievable when many channels are active, starting with many active channels does not automatically guarantee a high EB12. For DNA, a correlation can be seen between the number of estimated bases after 12 h (EB12) and the number of pores and channels at the sequencing start, respectively. For RNA, a relatively weak trend can be seen only for the number of active pores. (C) For DNA, the number of estimated bases after 12 h is independent of the amount of starting material loaded onto the flow cell (R2 < 0.01, P = .842). For RNA, a relatively weak trend shows increased amount of estimated bases after 12 h with increased amount of starting material loaded onto the flow cell (R2 = 0.19, P = .0165). NB, the regression line and R2 values are shown from linear regression, which resulted in the higher AIC values for DNA compared to second degree polynomial and exponential regression, see Supplementary Table S4.
Our experiments demonstrated that for flow cells with at least 1500 active pores, a DNA load of 350 ng produced reasonable sequencing output. Based on our experience, we found that loading 300 ng of DNA worked well for flow cells with ~1200 active pores, while ~250 ng was suitable for those with ~800 active pores, as detailed in Supplementary Table S1.
In most ONT protocols, the loading amount is recommended in ng or μg; however, especially for genomic DNA (gDNA), an amount in fmol is also given. For first-time use of new preparation methods or sample types, we would recommend to check the length of prepared nucleic acids and calculate the amount in fmol to ensure it falls in the range of recommendations of ONT—and always check the actual protocols as these recommendations change with new chemistry, flow cells or updated protocols. Please note: 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mu\end{document} g DNA of 8 kb fragments are 200 fmol; 5 kb are ~300 fmol; 50 kb are 30 fmol.
In our experience, when working with limited amounts of input DNA or RNA, using flow cells with a reduced number of active pores at the start of the sequencing run is sufficient for downstream analyses. However, as expected, lower input material consistently resulted in a reduced number of sequenced reads.
As expected, the library amount [ng] did not influence EB12 for the DNA samples (R^2^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} <\end{document} 0.01, P = .842), since early in the sequencing run, throughput is limited by the number of active channels, not library availability (see Fig. 2C and Supplementary Table S5). This might change during sequencing, when library is consumed and active pores are still available. A relatively weak trend towards higher EB12 was observed when loading more starting material for RNA though (R^2^ = 0.19, P = .0165). Note that R10 flow cells exhibit similar behavior in this regard as the discontinued R9 flow cells.
Instead, having extra input material available is advantageous when employing repeated washing of the flow cell, as it helps to extend overall sequencing yield. Specifically, when planning to wash a flow cell, we recommend splitting the library for the number of planned sequencing runs in advance (see details in Section Washing and reusing flow cells).
Minimal DNA loads for flow cells
For optimal results in DNA sequencing, ONT recommends to start the library preparation with 1000 ng (equivalent to 100–200 fmol; see Section Loading dependent on pore availability) of total DNA (SQK-LSK114 DNA library preparation kit). Following library preparation, ~50% of the DNA is typically available for loading into the flow cell.
For samples with limited amount of DNA that cannot be amplified, the following methods have yielded successful results: Although ONT’s instructions recommend a minimum of 100 ng, one study showed that without special treatment, 1 ng of DNA (prior to library preparation) yielded 6118 reads with an N50 of 3907 bases [51]. This result indicates good quality; however, reduced input correlates with decreased output. The carrier sequencing method allows for the sequencing of very small amounts of target DNA by combining it with a larger quantity of non-target “carrier” DNA. This technique successfully detected as little as 0.2 ng of target Bacillus subtilis DNA combined with 1000 ng of carrier Lambda phage DNA, producing high-quality data without amplification [52]. The 10× Genomics Chromium Controller is an advanced microfluidics-based platform that enables high-throughput single-cell and spatial analysis of DNA, RNA, and proteins. Cells processed by the Chromium Controller will result in barcoded GEMs (Gel Beads in Emulsion), which can be used in downstream applications, like nanopore sequencing, to enhance the analysis of low-input samples [53–55]. This technology allows for the sequencing of gDNA in quantities as low as 50 pg.
In our study, when loading as low as 30 ng DNA (see Section Loading dependent on pore availability) onto a flow cell (435 ng before library preparation, sample ID 67.6) we successfully obtained 648 Mb sequencing data. In another run we loaded 50 ng onto the flow cell (sample ID 129.2), which resulted in 1.5 Gb data output. Simon et al. reported the successful use of 50 ng in metagenomic samples with eight species [56].
Minimal RNA load for flow cells
RNA sequencing requires poly-adenylated 3′ ends (poly-A tails), which serve as consistent binding sites for ligating adapter molecules to create a continuous RNA–adapter complex that can pass through the nanopore (see Section Adapter ligation). Ribosomal RNA (rRNA), which lacks poly-A tails, makes up the majority of cellular RNA; for instance, Escherichia coli cells are estimated to contain around 85 % rRNA [57]. rRNA is naturally excluded when using a poly-A-based sequencing approach. For optimal RNA sequencing, we advise the user to always check the latest protocol version on the ONT website. Sequencing in this study was performed with RNA002, where input recommendations were 500 ng of total RNA or 50 ng of poly(A)-tailed RNA (which is in agreement with the ratio of rRNAs in the cell). However, this changed to 300 ng of poly(A) tailed RNA or 1 ug of total RNA with the current RNA004 protocol and chemistry (version DRS 9195 v4 revF 11Dec2024).
Many RNA molecules, including certain messenger RNAs (e.g. histone-coding) and many transcripts (like ncRNAs and non-canonical microRNAs), lack poly-A tails [58, 59]. To sequence all RNAs except rRNAs, it is recommended to add poly-A tails to the sample and perform an rRNA depletion step following the poly-A addition by polyA polymerase [60]. A method for total RNA sequencing with only 10 ng input material has been reported [61]. In our study, 42 ng of RNA (sample ID 152.1) produced 157 Mb of data.
Using external control sequences
The use of external control sequences (spike-ins) is a critical aspect of controlling library preparation and benchmarking the sequencing process in nanopore workflows. External controls consist of nucleic acid molecules with known sequences, which are introduced into a sample in precise amounts. These molecules undergo all the same procedural steps as the sample’s endogenous nucleic acids and are only distinguished during the final stage of sequencing data analysis. Both the DNA ligation sequencing kit (SQK-LSK114) and the direct RNA sequencing kit (SQK-RNA004) contain positive control strands. The DNA Control Strand (DCS, Supplementary Fig. S1) is a 3.6 kb standard amplicon that maps to the 3′ end of the Lambda virus genome. It is provided at a concentration of 10 ng/μl (according to ONT Live support asking “What is DCS?,” https://nanoporetech.com, accessed 12.12.2024, see Supplementary Fig. S2).
The RNA Calibrate Strand (RCS) is derived from the Saccharomyces genome, specifically the Enolase II gene (ENO2, YHR174W). It is supplied at a concentration of 15 ng/μl (according to ONT Live support asking “What is RCS?,” https://nanoporetech.com, accessed 12.12.2024, see Supplementary Fig. S3). Both DCS and RCS are utilized as positive controls for the library preparation and sequencing processes; and therefore are not used for normalization of sequencing results. In our experiments, we detected ~1 % of DCS/RCS sequences in our samples.
In addition to the ONT kit-supplied control sequences, the emergence of various RNA-seq platforms and protocols has highlighted the need for versatile RNA spike-in controls. These controls can be incorporated and processed alongside actual samples, allowing researchers to monitor and compare essential performance metrics such as sensitivity, input-output correlation, and the detection and quantification of transcript variants. Spike-In RNA Variants (SIRVs) are specifically designed synthetic RNA molecules that replicate key features of transcriptome complexity. Added in very small quantities before library preparation, SIRVs can be processed identically to endogenous RNA, ensuring accurate assessment of sequencing performance.
Adapter ligation
In adapter ligation, the final library preparation step for nanopore sequencing, adapters are attached to sample molecules. These adapters contain motor proteins that guide DNA or RNA into the nanopore and regulate sequencing speed, making ligation efficiency crucial for optimal output. The exact concentration of the adapters in the mixes is proprietary; however, it is standardized for reactions involving ~100–200 fmol i.e. approx. 1 μg of DNA (see Section Loading dependent on pore availability), 50 ng for poly(A)-tailed RNA, and 500 ng for total RNA.
New England Biolabs (NEB) generally recommends maintaining a 10:1 adapter-to-sample ratio to ensure efficient ligation without excessive free adapters. According to ONT (Chapter 43, “Troubleshooting your run from the pore activity plots”), high adapter-to-sample proportions generally do not hinder sequencing as long as the majority of pores are actively sequencing.
Amplicon sequencing
Amplicon sequencing is a common strategy to enrich low concentrations of nucleic acid target material from complex samples for subsequent sequencing [62–64], including vector sequence confirmation, targeted sequencing of full-length cDNAs, confirmation of disease-causing variants, and virus sequencing. Here, we focus on viral sequencing as one of many examples for amplicon sequencing [65–67]. Amplicon sequencing enriches specific DNA regions by amplifying them via PCR with specific primers. The resulting PCR products, or ‘amplicons,’ are then sequenced to provide high-resolution information on the targeted regions. A key example of amplicon sequencing is its application in viruses, such as SARS-CoV-2 during COVID-19, for cost-effective sequencing and genomic surveillance [68]. ONT has specific advantages in sequencing long amplicons without fragmentation, e.g. to retain information of co-occurring mutations on single amplicons supporting virus variant deconvolution and lineage assignment [69].
In nanopore sequencing, amplicons should be mixed in equal nanomolar concentrations [70–72]. When working with amplicons of the same size, mass concentrations (nanograms per microliter) can be used instead of molar concentration. Due to the specific target enrichment, amplicon sequencing can ensure sufficient sequencing depth to reliably detect genetic mutations, supporting detailed genomic analysis in fields requiring accurate variant detection.
Amplicon sequencing can be an effective approach to explore entire virus genomes with low viral titers. While a lot of work was done on SARS-CoV-2, amplicon designs are increasingly being explored for other viruses, such as plum pox virus (PPV). High-quality assemblies require overlapping amplicons, as accurate strain identification is crucial for diverse pathogens. In some protocols, however, such as the widely used ARTIC protocol (https://github.com/artic-network/artic-ncov2019) for SARS-CoV-2 sequencing, smaller amplicons of 392 bp are employed, with an overlap of 90 bp. In our study of SARS-CoV-2, we achieved high-quality results using longer amplicons of ~1200 bp, with an overlap of 117 bp between consecutive fragments [73]. We demonstrate an application of amplicon sequencing for the PPV genome, which exhibits substantial genetic diversity across strains (Fig. 3A). We gained a uniform distribution of coverage across the entire genome (see Fig. 3B), demonstrating the effectiveness of well-designed amplicons for comprehensive genomic representation. Monitoring the amplicon coverage is essential, as the aforementioned diversity may result in poor amplicon performance due to insufficient primer binding in mutated regions.
(A) Genetic diversity shown for each position in the PPV genome across 109 PPV isolates of all known PPV strains indicates reduced genetic diversity in the P3 gene region (purple) and enhanced genetic diversity in the genetic region encoding for the capsid protein (CP, orange). (B) Coverage comparison of three sequencing methods: direct RNA sequencing using nanopore (green), Illumina shotgun RNA-seq (blue), and nanopore-based PCR amplicon cDNA sequencing (pink) at the example of PPV. A characteristic feature of amplicon sequencing is the clear visibility of higher coverage regions due to the overlapping amplicon design and specific enrichment.
Direct RNA sequencing
The direct RNA sequencing protocol provides an approach for sequencing RNA molecules directly without the need for converting them into cDNA [14]. This method allows for sequencing RNA in its native form, thereby preserving RNA modifications, without amplification, and potentially capturing full-length RNA transcripts. An optional library preparation step involves synthesizing a cDNA strand from the RNA template through reverse transcription. Formation of RNA:cDNA hybrids can increase throughput and improve read quality by resolving complex RNA secondary structures, and can result in more consistent sequencing data [74].
However, this approach has two downsides: (i) prolonged incubation of RNA at high temperatures with divalent cations during cDNA synthesis can cause RNA degradation [75, 76]; (ii) common reverse transcriptases may cleave RNA strands. This can reduce the availability of full-length RNA and impact the efficiency and accuracy of direct RNA sequencing [77, 78].
A distinctive feature of ONT’s direct RNA sequencing is its ability to detect RNA modifications. However, the bioinformatics analysis of these modifications is still underdeveloped. In this work, we focus intensively on this aspect, see Section Hidden treasures in raw data.
Technical challenges in nanopore sequencing
Maximizing sequencing yield
Sequencing yield, also referred to as throughput, represents the total amount of data generated during a sequencing run. According to Wang et al. [2], the expected sequencing yield of a flow cell primarily depends on three factors: the number of active nanopores, the DNA/RNA translocation speed through the nanopore, and the running time. If the translocation speed is too fast, the system may struggle to differentiate between nucleotides, resulting in inaccurate sequencing. Conversely, if the translocation speed is too slow, throughput is reduced. The translocation speed is largely dependent on the motor protein used. For DNA, it typically operates at around 400 nt per second [2], while for RNA, the speed is slower, at ~130 nt per second, as stated by ONT. This speed can be monitored in real time if live basecalling is enabled, although it can only be controlled indirectly. For example, increasing the temperature may accelerate the translocation speed, whereas overloading the flow cell might lead to a reduction in speed.
To analyze sequencing yield, we explored over 300 ONT sequencing experiments across a variety of species, see Section Data characterization and description, Supplementary Table S1. We observed significant variation in sequencing yield and investigated potential reasons for this yield disparity (Figs 2 and 4). We first investigated which of the measured variables has an influence on EB12. A multiple linear regression analysis revealed that EB12 was significantly predicted by the number of active pores at the start of the sequencing run, pore half time, sequencing material (DNA, RNA), sample type, and flow cell version (R9, R10), see Table 1.
Maximizing sequencing yield. Sequencing yield from R9 (dots) and R10 (triangle) for DNA (orange) and RNA (blue) samples from a variety of species. Differences between DNA and RNA are evident in all plots, with a lower sequencing yield for RNA in general. Note that the lower RNA yield is dependent on the slower sequencing speed [2] and the generally low number of active starting pores during RNA runs. (A) Quadratic regression shows a significant correlation of pore half time on estimated number of bases after 12 h of sequencing (EB12) of DNA for virus, metagenomic, insect, mouse, and human samples (R2 > 0.7 and P-value < .01), but not for bacteria and protist (P-value > .05). The bacteria samples show high variability, and protist sample size is very low. For RNA, a significant correlation was found in the virus samples (R2 = 0.8 and P-value < .01). Regarding R9 and R10 data separately, the correlation for bacteria slightly increases (R9: R2 = 0.39, R10: R2 = 0.11), but still remains weaker than for the other sample types. For the other sample types, the correlation does not change when separating R9 and R10 flow cell data. (B) The EB12 yield is consistent across all sample types and flow cell type. (C) In contrast to EB12, the distribution of pore half time differs strongly between the sample types. Insect samples have a lower pore half time than other samples. Metagenomic and bacteria samples show a high variance. Data shown in panels (B) and (C) indicate differences between R9 and R10 data for metagenomic samples, see Supplementary Fig. S8. (D) The median read length does not influence EB12 (nor the amount of estimated bases after 24 and 36 h, see Supplementary Fig. S9) for DNA. For RNA, a weak positive correlation is visible, which is slightly stronger after 24 and 36 h, see Supplementary Fig. S9. But note that there are only a few data points for RNA after 24 and 36 h, and the amount of starting pores for RNA was generally lower. (E) The age of the flow cell (number of days from arrival to usage) does not affect EB12, although our data for R10 flow cells are limited.
Next, we analyzed the associations of the single variables in more detail, starting with the investigation of EB12 and the pore half time, which refers to the duration after which half of the nanopores in a sequencing run become inactive. Quadratic regression analysis revealed a significant correlation between pore half time and EB12 for DNA within all sample type groups except bacteria and protists (Fig. 4A) consistent for both R9 and R10 flow cells. This correlation remains significant even when normalizing EB12 for the number of starting pores, as shown in Supplementary Fig. S5. As expected, a slower degradation rate of the flow cell leads to a higher sequencing yield.
Next, we analyzed the impact of sample type on EB12. The DNA sequencing yields in our data are generally higher than those for RNA, as shown in Fig. 4B, likely due to increased translocation speed and higher number of starting pores. Although DNA mouse samples had a slightly higher EB12 than other sample types with P-values of .04 or below (except P = .067 for viruses), see Supplementary Fig. S6, it appears that after 12 h of sequencing, the sequencing yield was roughly similar, regardless of the sample type group. However, multiple linear regression analysis revealed that sample type, especially metagenomic samples, significantly influences EB12, indicating subtle but systematic differences between sample categories.
In contrast, the pore half time seemed to vary significantly depending on the type of sample being sequenced (Fig. 4C, with a median pairwise P-value of .011, see Supplementary Fig. S7). For DNA samples, for instance, we found that insect and protist samples tended to have a much shorter pore half time than mouse, metagenomic, or bacterial samples. Furthermore, flow cells used for bacterial, metagenomic, and human samples exhibited high variability in pore half time. The flow cell pore half time appeared to be influenced not only by DNA/RNA characteristics but also by the sample type. We assume these differences might originate from factors such as the quality and purity of DNA/RNA, the extraction method, the sequencing kit used, or the cell type composition. Further research is needed to investigate those factors.
Despite variations in pore half time, the distribution of EB12 appears more consistent across sample type groups, even though the influence of (metagenomic) sample type on EB12 was demonstrated in multiple regression analysis in Table 1. Notably, insect and protist samples, which exhibit shorter pore half time, did not stand out as outliers regarding EB12. To conclude, the sequencing yield produced within the first 12 h (EB12) varied less between the sample types in contrast to pore half time. When comparing data from R9 and R10 flow cells, as shown in Supplementary Fig. S8, a significant reduction in EB12 was observed when using R10 flow cells for bacteria, metagenomic, and viral samples (i.e. all sample type groups for which data from both R9 and R10 flow cells were available). These findings align well with the influence of flow cell type on EB12 identified in the multiple regression analysis in Table 1. Additionally, we analyzed the association of the buffer type (LFB/SFB) with EB12. Though a t-test comparing EB12 for SFB and LFB on R9 flow cells shows significant results (see Supplementary Fig. S4), multiple regression analysis revealed confounding effects by the other variables (see Table 1), as the amount of active pores at the start of the sequencing run. Taken together, after 12 h, the sequencing output depended on the number of active pores at the start of the sequencing run (Fig. 2) but remained relatively consistent regardless of sample type and flow cell type. The sample type seemed to be related to the degradation rate of the flow cell.
Most runs had a median read length below 3 kb, with EB12 values ranging from 0 to nearly 8 Gb and showing a fairly even distribution. The median read length had no impact on EB12 (see Fig. 4D), nor the amount of estimated bases after 24 and 36 h (see Supplementary Fig. S9). Here again, we see a clear distinction between DNA and RNA EB12 values. The R10 median read lengths aligned with those of R9 data.
We then turned our attention to the properties of the flow cells themselves. The age of the flow cell (the time between delivery and use) did not significantly affect EB12. Surprisingly, flow cells over 200 days old (Fig. 4E) performed similarly to younger ones, although exceeding the ONT warranty period of three months by a factor of two. The oldest flow cell we tested was 249 days old, performing normally (1080 pores, 3.1 Gb output), see Supplementary Table S1, ID 172.1.
In summary, a high EB12 is only achievable when many pores/channels are active at the beginning of the sequencing run (Fig. 2A and B). However, starting with many active channels does not automatically guarantee a high EB12. On the other hand, as expected, the number of active pores at the beginning of a sequencing run influence EB12, particularly when sequencing DNA (Fig. 2A).
Washing and reusing flow cells
Flow cells naturally degrade with use as active pores decrease, often due to blocking during sequencing. Blocked pores shift from “single pore” to “unavailable” in MinKNOW. While ONT acknowledges pore blocking, the reasons remain unclear. Pores can be unblocked using voltage reversal during mux scans or by washing the flow cell.
Washing is performed with ONT’s “Flow Cell Wash Kit” (EXP-WSH003, EXP-WSH004) to digest and remove loaded genetic material, which free blocked pores. The washing process involves flushing the flow cell with the wash buffer provided in the Wash Kit to remove residual sequencing reagents. This is done by slowly injecting the washing solution through the priming port and then carefully removing it after 1 h incubation to maintain pore viability. It is crucial to avoid introducing air bubbles. The cost of washing is minimal, ~€16 per washing step, as the Wash Kit EXP-WSH004 costs €95 for six washing steps (23 October 2024). The sequencing run can be paused or stopped before washing: pausing allows continuation in the same file, while stopping generates a new file for subsequent sequencing. After washing, the flow cell can be reused for sequencing a new sample, but the previous run must be stopped rather than paused. Washed flow cells can also be stored for future use. However, they will not perform as well as new ones; the number of active pores will be lower than initially but higher than before washing (see Fig. 5).
Influence of washing the flow cell on the amount of active pores. (A) A flowcell which has been paused (blue bars) two times for washing. (B) A flowcell which has been stopped (blue bars) five times for washing.
Washing can also be employed during a single sequencing run to increase sequencing yield by reactivating blocked pores. We identified three key factors to consider before sequencing: (i) potential washing steps, (ii) optimal washing timing, and (iii) available sample material. We investigated the impact of washing flow cells on pore recovery by analyzing 34 flow cells that had undergone 1–6 washes, totaling 58 washing steps, see Supplementary Table S1. Exemplarily, Fig. 5 depicts two such washing steps. Statistics of these two flow cells are displayed in Supplementary Table S6. In Fig. 5A, a human DNA sample (ID 111.1) was washed twice during sequencing (at 10 and 30 h), pausing the run (blue bars). Active pores decreased over time (red arrows) but increased after each wash (blue arrows). In Fig. 5B, a cricket DNA sample (ID 124.1–124.6) was washed five times during sequencing. A mean loss of 84.2% of active pores could be compensated by recovering on average 85.6% of those pores. ONT recommends washing and reusing a flow cell up to four times. Our findings show that flow cells can be washed at least five times and still produce data.
A key factor for high sequencing yield is determining the appropriate time point for washing, as yield depends on the number of active pores, see Fig. 2A. It appears that washing does not cause pores to block more quickly. Therefore, we recommend monitoring the cumulative output in MinKNOW: when the curve flattens (Fig. 5A, green line), we advise washing.
To ensure sufficient material for reloading, we recommend splitting the sample before starting the sequencing run if washing is planned to increase yield. For instance, when planning two wash steps, we recommend allocating 50%, 30%, and 20% of the total library for the initial run and subsequent washes, respectively. This decreasing allocation accounts for the reduction in active pores after each wash. Adding excess material beyond the capacity of available pores will not further increase sequencing yield (see Figs 2C and 5).
Adaptive sampling
Adaptive sampling allows for real-time selection or rejection of DNA fragments based on their sequence content, enabling targeted enrichment or depletion of specific genomic regions, organisms, or sequence types during a sequencing run (https://nanoporetech.com/document/adaptive-sampling) [79]. For example, adaptive sampling can enrich gene panels, CpG islands, or exomes [80], eliminate unwanted sequences such as highly repetitive regions or contaminants, and target specific organisms or deplete host in metagenomic studies [26, 81, 82]. This approach enhances coverage of targeted genes or regions without requiring complex library preparation steps like amplification or hybridization probes.
As DNA molecules pass through a nanopore, the device generates real-time sequence data (Fig. 6A). This requires a GPU and a fast basecalling model for adaptive sampling. The system quickly compares each fragment’s partial sequence (~400 bases, corresponding to one second of data) to a reference genome or user-defined targets. When enriching for regions of interest, the sequencing continues only if the sequence matches the region of interest; otherwise, the DNA fragment is ejected, conserving flow cell resources and improving yield for desired regions. The depletion mode works vice versa.
A) The basic principle of adaptive sampling: For each read, the first ~400 bases are mapped to a given reference, and sequencing continues only if they match a region of interest [26]. About 450 bases are sequenced per second for R9 nanopore flow cells [2]. Rejecting a read takes ~0.5 s, with additional time required to capture the next read. (B) In order not to miss the reads where the region of interest is not at the beginning of the read, the target region should be defined with an extended buffer (e.g. 2000 nt) on both sides in the configuration .bed file. (C) Adaptive sampling was used to enrich over 30 000 CpG islands (CGIs) to increase their coverage. To increase enrichment, each CGI region of interested was extended by a 500 nt buffer region and by a 2000 nt buffer region on both sides.
As the acceptance or rejection of a read is based solely on its first chunk, strands may be rejected, if the region of interest is not located at the beginning of the read. To address this, the target region of interest should be extended by adding a ‘buffer’ on both sites in the configuration file to accommodate sequencing in either direction, as sequencing can initiate from either DNA strand (Fig. 6B). Thus, for adaptive sampling, the buffer describes a certain number of nucleotides by which the region of interest can be extended in the bed file (a text file format used to store genomic regions).
Applying adaptive sampling may yield a higher number of short reads (reducing N50 and median read length) and necessitate re-basecalling with a higher accuracy model post-sequencing to enhance the data quality.
We applied adaptive sampling to enrich CpG islands in the human genome, see Supplementary Table S7. We targeted ~30 000 CpG islands, with an average length of 777 nt (IDs 153.1–178.1). We applied adaptive sampling without adding a buffer region, using a 500 nt buffer region, and a 2000 nt buffer region on both sides of the target regions. We achieved up to 17 × mean sequencing depth on CpG islands, representing a fourfold increase compared to the mean sequencing depth of the whole genome, when applying 2000 nt buffer (Fig. 6C). According to ONT, the theoretical maximum increase in sequencing depth achievable through adaptive sampling is ~5–10-fold (https://nanoporetech.com/document/adaptive-sampling). In summary, adaptive sampling can effectively enhance sequencing depth in specific genomic regions or target species, albeit with increased time requirements for post-processing, such as re-basecalling and filtering out short, unwanted reads.
Challenges in analyzing ONT data
Building an ONT pipeline
Bioinformatics analysis of ONT sequencing data is based on modular pipelines that can be assembled and utilized effectively, even by users with limited computational backgrounds (Fig. 7). Instead of relying solely on standard, one-size-fits-all pipelines, it is critical to carefully select tools that best align with specific experimental goals and data characteristics. The following pipeline steps are very similar between DNA and RNA sequencing with ONT up to and including the mapping step. Afterwards, the processing steps usually diverge and become more specific [83].
The standard pipeline for nanopore sequencing data involves basecalling of the raw data, followed by quality control (QC) and filtering. The reads can then be analyzed with specialized long-read tools for mapping, polished assembly, or other workflows.
A well-constructed ONT analysis pipeline includes several steps, which can be addressed by different tools:
Basecalling is the transformation of raw signal data, or “squiggles,” into nucleotide sequences using neural networks. Typically, the basecalling tools have a pre-processing step, such as removing the adapters and normalizing individual signals. The following three tools are all provided by ONT. Currently, Dorado (https://github.com/nanoporetech/dorado) is the state-of-the-art software for basecalling on ONT platforms that supports the older fast5 and the new pod5 format, which is a more compact format enabling faster and more efficient file access. Dorado includes basecalling models for all ONT kits, predicts modified bases, and has built-in demultiplexing capabilities for barcode detection from ONT barcoding kits. Bonito (https://github.com/nanoporetech/bonito) is an alternative tool that allows users to train their own neural networks for basecalling. Remora (https://github.com/nanoporetech/remora) adds an additional layer, enabling nucleotide modification calling by training neural networks along the standard bases. Alternative tools for nucleotide modification prediction are discussed in detail below. Basecalling results are typically saved in fastq format for subsequent analysis steps.
After basecalling, QC is essential to evaluate the reliability of the sequencing data and to identify any potential issues in the dataset, such as degradation, contamination, and read length distribution. One of the most comprehensive QC tools for ONT data is pycoQC (https://github.com/a-slide/pycoQC)) [84], which provides an interactive html overview that includes information on read quality, read length, sequence coverage, active channels, and quality over sequencing time. Other QC tools, such as NanoPlot (https://github.com/wdecoster/NanoPlot) [85] and nanoQC (https://github.com/wdecoster/nanoQC), provide a restricted functionality, but with a different representation of the data.
Following QC, filtering is often necessary, especially for experiments requiring specific read quality or length thresholds. Filtering tools help to remove low-quality reads, adapters, barcodes, or reads outside the desired range, increasing the overall quality and relevance of the dataset. Chopper (https://github.com/wdecoster/chopper) from the NanoPack suite [85] or Filtlong (https://github.com/rrwick/Filtlong) are widely used for filtering ONT reads. Dorado also provides parameters to trim adapters and barcodes.
Once high-quality reads are obtained, subsequent analyses such as mapping and assembly can be performed using specialized tools designed for long-read data. For read mapping, minimap2 (https://github.com/lh3/minimap2) [86] is widely used. Winnowmap (https://github.com/marbl/Winnowmap) [87] is an alternative mapping tool that has been specifically optimized for complex genomes. It has demonstrated superior performance when mapping both simulated PacBio and ONT data for human genomes [87]. For assembly, Flye is one of the most widely used tools, demonstrating good performance in terms of contig length and error rate in bacteria [88]. It includes polishing and can be told how many iterations of polishing should be performed. Alternative tools, such as Shasta (https://github.com/paoloshasta/shasta) [89], have been shown to perform faster, while NextDenovo (https://github.com/Nextomics/NextDenovo) [90] can generate longer contigs, particularly in the case of mollusks [91]. Error correction (polishing) can be subsequently utilized to refine the assembly. Medaka (https://github.com/nanoporetech/medaka) is aN ONT-specific neural networks, trained on typical ONT errors to correct these errors effectively. Medaka has proven particularly effective in reducing deletion errors [88]. Additionally, Racon (https://github.com/isovic/racon) [92] is a consensus caller using high quality reads from technologies such as Illumina to reduce single nucleotide variants and insertion errors, enhancing the overall accuracy of the assembled data.
Hidden treasures in raw data
Beyond the standard pipeline of ONT tools, the raw data from ONT sequencing contains information about nucleotide modifications. ONT provides basecalling models to detect specific modifications using the Dorado basecaller, such as 4mC, 5mC, 5hmC, and 6mA for DNA, and m5C, m6A, Ino, psU and 2′-O-me modifications (mC, mG, mA, and mU) for RNA; see Supplementary Table S2.
To detect other modifications, using modification detection tools or analysis of the raw ONT data is needed. Analyzing the raw signal requires significant self-implementation and detailed knowledge. Here, a demonstration of the fundamentals necessary to fully exploit the potential of ONT by using DNA and RNA modification detection as an example is given. This description will pave the way to complement existing tools for nucleotide modification analysis, which are currently designed for only a small fraction of modifications. To achieve this, we focus on processing the raw data, which can be divided into five distinct phases: (i) Accessing raw data: in this initial step, the raw signal data from the nanopore sequencer is imported into the computational analysis. This is particularly important because it serves as the foundation for all subsequent analyses. (ii) Improving Signal Segmentation: is applied to refine the process of identifying individual signals corresponding to nucleotides. Improving segmentation can help to enhance the accuracy of the subsequent basecalling step by ensuring that the software can effectively distinguish between different signal components. (iii) DNA Modification Detection: Once the signal data is well-segmented, this step focuses on identifying specific modifications to DNA (e.g. methylation). Detection of DNA modifications is important for understanding epigenetic regulation and gene expression, providing insights into cellular functions and biological processes. (iv) RNA Modification Detection: Similar to DNA modification detection, this step involves identifying modifications in RNA sequences. RNA modifications can influence gene expression and stability, making their detection crucial for studying transcriptomics and understanding post-transcriptional regulation. (v) Differential Signal Detection: Alternatively to predicting modification types directly, the differences in signal patterns between conditions or samples can be analyzed to reveal a general change of the nucleotide.
Preprocessing raw data
ONT stores the “squiggles” as integers [93] to save storage space, which can be converted back to pA; see Supplementary Equation S1. Normalization of the raw ONT signals is essential for reliable comparisons within and across sequencing experiments. Signal variations caused by differences in flow cells, nanopores, and sensors can introduce biases (Supplementary Fig. S10), making normalization (Fig. 8A) critical for accurate and reproducible analyses (Supplementary Equation S2). Outliers in ONT signals, as in Fig. 8B, are often caused by faulty sensor measurements and can distort downstream analyses. To ensure data quality, these outliers should be identified and filtered prior to further signal processing. The ONT signal, comprising a continuous series of data points, may occasionally include multiple (adjacent) sequencing reads, as shown in Fig. 8C. The reason for this phenomenon is unresolved. To ensure accurate and reliable downstream analyses, such signals must be split into individual reads using the coordinates provided by Dorado.
Overview of signal data preprocessing steps. (A) To standardize the data, the unnormalized time series (left) are converted (arrow) into the corresponding normalized signal (right, mostly ranging from −4 to 4). (B) Detected outliers (pink circles) indicate points that deviate significantly from the expected signal pattern, likely due to sensor errors. (C) A single signal can contain more than one read. Here indicated with signal containing three recurring components for “Read 1” and “Read 2” specific to RNA sequencing: “adapter,” “poly A,” and “transcript.”
The python package read5 (https://github.com/rnajena/read5) is a wrapper that unifies access to fast5, pod5, and slow5 data formats and provides functions for signal conversion (Supplementary Equation S1) and normalization (Supplementary Equation S2). Each of these file formats has its own API using different function names for the same functionality, which often requires to write specific code to handle these inconsistencies.
Improving signal segmentation
The analysis of raw data plays a crucial role for detecting chemical nucleotide modifications. This is particularly relevant for the vast number of around 170 RNA modifications for which no basecalling models currently exist. These modifications can produce characteristic signals when translocating through the nanopore. Analyzing raw signals for modification detection requires improved signal segmentation, known as ‘resquiggling.’ This process involves realigning the raw current signal to the nucleotides, see Fig. 9. This segmentation is crucial because the motor protein that moves the nucleotides through the pore does not operate at a constant speed. The time a nucleotide spends in the pore (‘dwell time’) varies depending on the specific nucleotide [94].
Overview of signal processing and modification calling. The raw ONT signal can be segmented and corrected for basecalling errors by resquigglers using the raw signal and basecalls. Some resquiggling tools also require a reference sequence and a mapping.
Several tools have been developed to facilitate the resquiggling process and generally require (i) the raw ONT signal (fast5, slow5, or pod5) files and (ii) the basecalls generated by the basecalling process (fastq). Optionally, (iii) a reference sequence (fasta) and (iv) the alignments of the basecalls to the reference (bam) can be added. The most widely used tool f5c [95] is a GPU-accelerated, multi-threaded re-implementation of Nanopolish Eventalign. It offers significant reductions in runtime while maintaining the known output formats. Notably, f5c also supports the newer pores, such as R10 and the new RNA pore (RP4). For instance, a sequencing run of a 32 000 bp Coronavirus genome (sample ID 13.1) with 1.3 Gb across 857 000 reads produced a 181.5 GB output file using Nanopolish Eventalign, which was reduced to 84 GB with f5c. Tombo has not been supported by ONT since 2020 and is compatible only with older single-read fast5 formats. Despite its limitations, Tombo remains in use today for specific applications.
Future developments in resquiggling tools are expected to further enhance compatibility with emerging nanopore technologies and data formats.
Detecting DNA modifications
The direct detection of DNA modifications without amplification or pre-treatment of the DNA is an exciting feature of nanopore sequencing, potentially providing insights into epigenetic regulation and genomic variability. DNA modifications alter the electrical signal as the DNA passes through the nanopore resulting in a differing picoampere signal for k-mers (k = 5 for R9; k = 9 for R10) containing modified bases compared to those consisting of only unmodified bases, see Fig. 10. Currently, 17 types of DNA modifications have been identified in the genomes of bacteria and eukaryotes [96], with the most well-characterized being methylation of cytosine (i.e. 5-methylcytosine, 5mC, and 5-hydroxymethylcytosine, 5hmC [97]) and adenine (i.e. N6-methyladenine, 6mA [98]).
Modified nucleotides have their own characteristic signal. Machine learning models can detect this signal directly or by comparing it to the unmodified counterpart. The red and blue lines represent the mean signal of 200 reads, respectively, while the vertical bulges indicate the signal distribution per base following a normal distribution. Modifications can lead to error patterns in the reads that can be detected by tools.
Several computational tools have been developed to identify modifications from nanopore data, with a primary focus on 5mC and 6mA, using different computational approaches, in particular Hidden Markov Models, statistical tests, or neural networks, see Supplementary Table S2. The required input data for the methylation calling tools are usually the raw or basecalled fast5/pod5 files.
The DNA modification calling was traditionally performed after basecalling using tools like Nanopolish [99] or ONT’s Megalodon (https://github.com/nanoporetech/megalodon). Currently, ONT’s Dorado (https://github.com/nanoporetech/dorado) detects CpG methylations directly during basecalling if specified, and provides greater accuracy than tools like Nanopolish [100], which has difficulty when nearby CpG sites differ in methylation [99]. The performance of modification-detection tools varies by slight over-prediction (e.g. Nanopolish) or under-prediction (e.g. DeepMod) as described in several reviews [25, 100–103]. Validation of modifications remains essential, although tools like Nanopolish, Megalodon, DeepSignal, and Dorado already show high correlation with the gold standard whole-genome bisulfite sequencing (WGBS) [100, 101]. Meta approaches improve modification detection accuracy by combining the outputs of multiple tools. For example METEORE combines results from up to six tools using a random forest model, significantly improving accuracy but at the cost of increased runtime [25].
Detecting RNA modifications
The detection of RNA modifications using nanopore sequencing is less developed compared to DNA modifications, possibly because RNA is known to contain over 170 distinct modifications [104, 105]. RNA modifications are of the utmost importance for RNA function and regulation.
Computational tools exist only for detection of a handful of RNA modifications, see Supplementary Table S2. Aside from these downstream analysis tools, ONT’s Dorado directly enables modification detection by specifying specific basecalling models. Users can further customize the process by manually disabling additional filters as needed. Excitingly, basecallers like Bonito and Remora allow researchers to train custom models for detection of modifications of any kind. These models can be created from scratch or fine-tuned from pre-trained versions, offering flexibility for different datasets and research objectives. The models can perform binary classification to predict a certain modification, or directly call modified bases alongside the standard nucleotide sequence.
Detecting differential signals
Comparing raw ONT signals between different samples provides a powerful approach for detecting differential signals, which can point to mutations, modifications, and isotopic labels [106]. The goal of this approach is not necessarily to identify the exact type of modification but rather to detect the presence of a modification. The signal may differ in its measured pA range, including mean, variance, skewness, kurtosis, and dwell time. The calculation of these measures can be applied to full-length nanopore reads. To minimize normalization biases, comparative analyses should ideally be performed using the same flow cell. When comparisons across multiple flow cells are unavoidable, further normalization steps are needed.
On the other hand, for example, in the case of modifications such as deuterium in the DNA/RNA backbone [106], normalization can inadvertently obscure biologically meaningful information. This is particularly true when the signal-to-noise ratio—where the signal represents the modification and the noise stems from instrumental bias—is low. To address these challenges, carefully controlled experimental designs are essential to ensure that observed differences in the nanopore signal reflect genuine modifications rather than artifacts of the sequencing process.
Conclusion
In summary, this study has yielded critical insights into refining nanopore sequencing using the MinION and GirdION sequencing platforms, with far-reaching implications for research and application. We displayed suggestions of nanopore sequencing for the user in the three categories: (i) library preparation tips; (ii) technical tips; and (iii) analyzing ONT data, see a summary in Fig. 11.
TIPS & TRICKS mentioned within the manuscript are summarized here for a better overview.
We reported general statistics for the number of sequenced bases after 12 h (EB12) on both R9 and R10 flow cells. As expected, the EB12 depended on the initial number of pores and the pore half time, which surprisingly varied depending on the sample type. In contrast, EB12 showed no correlation with the amount of input material, read length, or flow cell age. Additionally, we demonstrated that flow cells can still perform well after their expiration date. Meanwhile, we could also verify most of the observations on R10 flow cells. However, for the current study the sample size is too small for statistically robust conclusions.
With this publication, correlations and assertions from the ONT community have been statistically substantiated and are now citable for future work.
Supplementary Material
gkag023_Supplemental_File
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Laszlo AH, Derrington IM, Ross BC et al. Decoding long nanopore sequencing reads of natural DNA. Nat Biotechnol. 2014;32:829–33. 10.1038/nbt.2950.24964173 PMC 4126851 · doi ↗ · pubmed ↗
- 2Wang Y, Zhao Y, Bollas A et al. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol. 2021;39:1348–65. 10.1038/s 41587-021-01108-x.34750572 PMC 8988251 · doi ↗ · pubmed ↗
- 3Petersen LM, Martin IW, Moschetti WE et al. Third-generation sequencing in the clinical laboratory: exploring the advantages and challenges of nanopore sequencing. J Clin Microbiol. 2019;58:e 01315–19. 10.1128/JCM.01315-19.31619531 PMC 6935936 · doi ↗ · pubmed ↗
- 4Si HQ, Wang P, Long F et al. Cancer liquid biopsies by Oxford Nanopore Technologies sequencing of cell-free DNA: from basic research to clinical applications. Mol Cancer. 2024;23:265. 10.1186/s 12943-024-02178-6.39614371 PMC 11605934 · doi ↗ · pubmed ↗
- 5Meyer D, Hennig A, Hums AB et al. Nanopore sequencing-derived methylation biomarker prediction for methylation-specific PCR in patients with head and neck squamous cell carcinoma. Clin Epigenetics. 2025;17:149. 10.1186/s 13148-025-01960-7.40946119 PMC 12433006 · doi ↗ · pubmed ↗
- 6Nurk S, Koren S, Rhie A et al. The complete sequence of a human genome. Science. 2022;376:44–53. 10.1126/science.abj 6987.35357919 PMC 9186530 · doi ↗ · pubmed ↗
- 7Nicholls SM, Quick JC, Tang S et al. Ultra-deep, long-read nanopore sequencing of mock microbial community standards. Gigascience. 2019;8:giz 043. 10.1093/gigascience/giz 043.31089679 PMC 6520541 · doi ↗ · pubmed ↗
- 8Overholt WA, Hölzer M, Geesink P et al. Inclusion of Oxford Nanopore long reads improves all microbial and viral metagenome-assembled genomes from a complex aquifer system. Environ Microbiol. 2020;22:4000–13. 10.1111/1462-2920.15186.32761733 · doi ↗ · pubmed ↗
