Continuous, Low Latency Estimation of the Size and Shape of Single Proteins from Real-Time Nanopore Data
Yuanjie Li, Cuifeng Ying, Michael Mayer

TL;DR
A new method enables real-time analysis of nanopore data to instantly determine the size and shape of single proteins during experiments.
Contribution
The novel Two-Sliding Window algorithm allows real-time processing of nanopore data with millisecond response times for protein characterization.
Findings
The TSW algorithm processes data at 40 MB/s on a 1.8 GHz CPU, enabling real-time analysis.
The TSW algorithm accurately detects events with dwell times greater than 1.5 × (fc)−1.
Integration of a Naïve Bayes classifier allows real-time classification of protein mixtures for immediate feedback control.
Abstract
Existing approaches for nanopore sensing typically analyze resistive pulses from the translocation of individual proteins through the nanopore after completing the experiment. This approach foregoes instantaneous protein identification and precludes real-time experimental control. Here, we introduce a method for the analysis of real-time nanopore data capable of characterizing the size and approximate shape of proteins within a millisecond response time during data acquisition. The implemented real-time Two-Sliding Window (TSW) peak detection algorithm makes it possible, for the first time, to filter data, process baselines, and extract resistive pulse information during nanopore recordings using a stream of single data points. This approach achieves a computational throughput of 40 MB/s on a 1.8 GHz laptop CPU. We compared the accuracy of dwell time determination of the TSW algorithm…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
1
2
3
4
5
6- —Academy of Medical Sciences10.13039/501100000691
- —Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung10.13039/501100001711
- —Adolphe Merkle Foundation10.13039/501100003071
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNanopore and Nanochannel Transport Studies · Ion-surface interactions and analysis · Genomics and Phylogenetic Studies
Introduction
Nanopore-based resistive pulse sensing, inspired by the principle of the Coulter Counter for blood cell counting,? has shown remarkable progress in the characterization of unlabeled proteins in solution. ?−? ? ? ? ? ? ? ? ? Briefly, a thin insulating membrane separates two compartments of electrolyte solution, with a single nanopore serving as the only connection. Two Ag/AgCl electrodes apply a constant potential difference between the two reservoirs and measure the ionic current through the nanopore via a low-noise amplifier. ?,? The passage of an insulating particle through the nanopore transiently replaces the electrolyte solution, producing a resistive pulse. These resistive pulses, also called events, can reveal various parameters of the transiting particles, including size, shape, charge, dipole moment, and rotational diffusion coefficient. ?,?−? ? ?
Analyzing resistive pulse data usually consists of four steps: correction for drifts in the recorded current baseline, low-pass data filtering, peak detection, and peak analysis. ?−? ? ? ? ? ? ? ? Low pass-filters, such as the Gaussian filter ?,? or Butterworth filter, ?−? ? are usually employed to remove the high-frequency noise. Although peak detection methods vary between research groups, they generally rely on threshold algorithms to extract events from filtered data. ?,?,?,?,? A common approach uses a threshold of five times the standard deviation of the recorded current baseline, also called root-mean-square (RMS) to distinguish resistive pulses with a 99.99994% confidence level from baseline fluctuations due to random noise.? This approach starts with baseline detection using a moving average of over 100 to 30,000 data points to account for long-term drift and signal fluctuations. ?−? ? ?,? Several peak detection methods have been developed to improve the accuracy of event determination. Plesa et al.? developed an iterative baseline approach to refine the baseline. Gu et al.? introduced a second-order differential-based calibration (DBC) method to determine the dwell time and magnitude of blockades of resistive pulses. Furthermore, various fitting methods, such as Adaptive Time-Series Analysis (ADEPT),? Cumulative Sum (CUSUM), ?,? AutoStepfinder,? and Rissanen’s Minimum Description Length (MDL)? have been proposed or explored to analyze resistive pulses with short and long residence times or to detect step signals.
These established data analysis methods have in common that they typically require data to be collected over extended periods prior to analysis, thereby foregoing the possibility of real-time feedback for experimental control during recording. For example, estimating a protein’s shape requires resistive pulses with extended duration to sample various orientations. For instance, at a bandwidth of ∼ 50 kHz, we showed that resistive pulses should last for at least 150 μs to increase the probability that several protein orientations can be sampled.? Without knowing the number of captured resistive pulses that exceed this minimum-duration threshold during a recording, it is challenging to decide when to stop the experiment.? Wang et al.? developed an adapted finite impulse response filter for real-time threshold searching during data acquisition. However, the threshold searching algorithm usually required additional baseline correction and resistive pulse fitting steps to accurately extract the duration and amplitude of current blockades. These processing steps are not only time-consuming but also dependent on contextual data collected during the experiment. The high sampling rates during nanopore recordings, such as 500 kHz, and the recently developed parallel recordings? make low-latency, high-precision data analysis particularly challenging. Therefore, developing a robust, real-time processing method that enables accurate event detection during nanopore experiments is needed.
Here, we adopt a two-sliding window (TSW) algorithm to detect resistive pulses from translocation events accurately and instantly, making it possible to characterize the volume and approximate shape of proteins while they are still inside the nanopore or shortly thereafter. This TSW approach applies the 5 × RMS and the z-test? criteria to detect the start and end of resistive pulses and, hence, their duration and amplitude. We implemented the approach with incremental one-pass algorithms,? which enable real-time detection of resistive pulses with a time cost for computation linear to the number of data points. Using simulated square pulses, we demonstrate that the TSW algorithm performs as well as the established threshold searching algorithm (TS) for the detection of resistive pulses and provides improved accuracy in determining dwell times of long resistive pulses or resistive pulses with large intraevent current modulations. We implemented the TSW algorithm into the real-time protein characterization method and used it for experiments to determine the shape and volume of two natively folded proteins with nonspherical shape, Immunoglobulin G (IgG) and Thyroglobulin (Tg). We demonstrate that this approach achieves accurate estimates of protein volume and shape during the recording while the protein is still in the pore or less than 1 ms after exiting the pore. Furthermore, integrating a Naïve Bayes classifier into the TSW algorithm enables real-time classification of protein mixtures as well as low-latency voltage triggering for possible protein trapping or sorting in the future. This work, therefore, introduces a robust and low-latency method to estimate the shape and volume of individual, natively folded protein molecules while they translocate through a nanopore.
Baseline Determination and Time Requirement of the TSW Algorithm
Figure presents an overview of the analysis software (called PyDAQ, see Figure S1) that we developed to collect data and determine real-time estimates of the size and shape of proteins. As proteins translocate through a nanopore, the ionic current exhibits resistive pulses, each corresponding to the presence of a protein in the pore (FigureA). These detected resistive pulses, highlighted by the green vertical lines in FigureB, provide the opportunity to estimate the protein’s volume and shape in real time, based on the observation that nonspherical proteins produce orientation-dependent current modulations as opposed to spherical proteins, which produce resistive pulses that do not depend on protein orientation in the electric field (FigureC).? FigureD shows that as the number of detected resistive pulses from a solution of pure protein accumulates, the estimated shape and volume converge to a stable, accurate value after an accumulated resistive pulse time of 18 ms. The results demonstrate that extending the recording time of protein translocation beyond this time will bring no further benefit regarding the accuracy of volume and shape determination. This real-time feedback on protein parameters minimizes the duration of nanopore recording needed for accurate analysis, thereby improving the efficiency of sensing.
Estimating the volume and shape of proteins during nanopore data acquisition. (A) Schematic illustration of the nanopore recording setup and of the real-time data analysis approach for protein characterization. The electrolyte contains 2 M KCl with 10 mM HEPES buffered at pH 7.4. Two Ag/AgCl electrodes apply a potential difference of −100 mV across a nanopore with a diameter of 20 nm and a length of 30 nm (with negative polarity applied to the top). (B) Representative current recording of protein translocations through a nanopore (top), with resistive pulses detected instantaneously by the TSW algorithm as indicated by green pulses (bottom). (C) Principle of determining shape and volume of proteins from I min and I max. Top: translocation of different proteins with their shapes represented as a sphere (streptavidin), oblate (IgG), and prolate (Tg). Bottom: Representative current pulses generated from protein translocations and their corresponding histogram, where I min and I max are the minimum and maximum current blockades of the single resistive pulse. The ratio between the magnitude of I min and I max determines the shape m of proteins. Here, m is defined as the axis ratio b/a of an ellipsoid of revolution with semiaxes a, a, b; m < 1 corresponds to an oblate shape, while m > 1 indicates a prolate shape. (D) Estimation of shape (m) and volume (V) from the cumulative residence time of detected resistive pulses during recording for a protein modeled as an oblate shape IgG. The arrows represent that the estimated volume and shape stabilize at V ref = 332 nm3 and m ref = 0.46 after a cumulative residence time of 18 ms. Here, only the resistive pulses with dwell times greater than 150 μs were analyzed. Data were acquired at a sampling rate of 500 kHz and a bandwidth of 50 kHz. This measurements were performed independently of those in Figure C.
The approach for real-time protein characterization introduced here consists of four independent processes, which are illustrated in the flowchart in Figure. These processes include data acquisition, low-pass filtering with event detection, data visualization with data storage, and estimation of protein size and shape. The algorithm continuously acquires the data at the maximum sampling rate of the recordings of 500 kHz and temporarily stores them in a hardware buffer with a size of 10k samples. Simultaneously, the TSW algorithm performs low-pass filtering with a real-time Butterworth filter and peak detection on the data stream (right inset of Figure). Another computing process stores and visualizes the data, while a new process estimates protein size and shape according to the peak detection results of the TSW in real time.
Flowchart illustrating the process of real-time determination of the size and shape of proteins from nanopore recordings. Data is collected from a data acquisition card every 20 ms in each loop. Concurrently, data are processed in real time via a low-pass Butterworth filter and the TSW algorithm (inset). Detected events are visualized in real time and stored on disk by a different task process. Concurrently, within 20 ms, an independent thread estimates the protein parameters by continuously receiving the detected resistive pulses, calculating the shape and volume, and feeding this information back to a thread for visualization and file storage.
Briefly, the TSW algorithm uses two moving windows, w 1, and w 2, to accurately localize the start and end of a resistive pulse, where w 1 is a variable-length window for estimating baseline parameters, and w 2 is a fixed-length window for estimating parameters of resistive pulses. These windows update the median, mean, standard deviation, minimum, and maximum value for the measured current using established incremental one-pass algorithms ?−? ? (see Supporting Note 1). FiguresA and S2 visualize how a parameter called sigma level (n-sigma, red) and a parameter called *z-*score (z, blue) change over time during nanopore recordings. These two parameters facilitate the identification of the start and end of resistive pulses. The sigma level is the ratio between the amplitude of the resistive pulses and the standard deviation of the baseline (n-sigma = (I – I 0)/σ). The TSW algorithm detects the start of a resistive pulse when n-sigma is larger than 5, i.e., I – I 0 > 5σ, where σ is the moving standard deviation, I 0 is the average or median current, I is the current of pulses in window w 1. Although reducing the n-sigma value enhances sensitivity to low signal-to-noise ratio events, it also introduces a higher false positive rate (see Supporting Figure S3). Unlike the global noise used by our group and many others previously, the moving standard deviation used here can improve the adaptability of the peak detection algorithm in order to analyze unstable current baselines during experiments. To reduce the errors for the determination of the end of resistive pulses caused by single-point-based criteria, the TSW algorithm computes the z-score (z) for each data point and applies a z-test threshold (z < 1.96) to determine the end of resistive pulses, as described in eq.
Here, z is the z-score, , are the mean currents and σ_1_, σ_2_ are the standard deviations calculated in w 1, w 2. The values of N 1 and N 2 represent the initial size in terms of the number of data points within w 1 and w 2. Only data points with z-score of less than 1.96 (at the confidence level of 99.9%) indicate the end of a resistive pulse. We recommend setting N 1 within the range of fs/1000 to fs/10 and N 2 between 10 and 100, depending on the specific nanopore type and its noise characteristics. We also constrain the maximum and minimum values of the baseline within the window w 2 to remain within the range I 0 ± 5σ_1_ (Figure S2), which is useful to detect resistive pulses with non-Gaussian distribution of the blockade current, such as the current blockade from the translocation of nonspherical proteins.
Principle and time complexity of the TSW algorithm for real-time analysis of resistive pulses data. (A) Signal-to-noise ratio (red) and z-score (blue) as a function of time during the analysis of a resistive pulse. The red dashed line shows the reference value of n-sigma of 5, which has to be exceeded to detect the start of a resistive pulse. The blue dashed line shows the threshold z-score of 1.96 for detecting the end of a resistive pulse at a confidence level of 99.9%. (B) Experimental current recording of resistive pulses from protein translocations through a nanopore. The orange curve represents the baseline determined using the TSW algorithm, and the green curves indicate the identified resistive pulses using the TSW algorithm. (C) Zoomed-in region from B (light green shading) showing a detected short resistive pulse with start (red) and end (blue) points, along with an arrow indicating a detected long resistive pulse. (D) Time of CPU processing during analysis as a function of data acquisition time. Red circles represent the time consumption for data updated every 20 ms, with a sampling rate of 500 kHz. Spikes in CPU time are due to the process scheduling by the operating system. (E) Accumulated CPU time for processing a current trace with a duration of 20 s.
The determined baseline, represented by the orange curve in FigureB, agrees well with the expected baseline throughout the entire recording, even in regions with high event frequency or events that last for unusually long times (FigureC). In contrast, the baseline determined by the established threshold searching (TS) algorithm is strongly influenced by the event frequency and duration (see Figure S4). As opposed to the established TS with a fixed size of the moving analysis window, ?,?,? the TSW algorithm varies the window size during analysis: the window size increases or remains unchanged when there are no resistive pulses and stops updates of its size when a resistive pulse occurs. Therefore, the TSW algorithm can effectively eliminate the influence of resistive pulses on the baseline. In addition to an accurate determination of the baseline, the TSW algorithm minimizes the effects of the filter rise time on the determination of dwell times by defining the start of the rising edge as the start of a resistive pulse and the start of the falling edge as the end of a resistive pulse (FigureC). The start and end of resistive pulses detected by the TSW algorithm are comparable to those determined by the TS algorithm (see Figure S4). The TSW algorithm offers two optional approaches to calculate the baseline, moving median, and moving average (see Supporting Note 1, algorithm 1). These two approaches for baseline determination proceed with different speeds since the analysis time of the moving average is proportional to the number of data points, n, but the analysis time of the moving median is proportional to *n *log (w 1), where n is the number of points and w 1 is the size of the w 1 window. Although the moving median algorithm is slightly slower than the moving average algorithm, it provides a more accurate baseline determination due to its robustness against outliers. Besides, the program provides a convenient approach to calculate a reference current of target molecules and it includes a flush function to manually reset the baseline during the recording if an artifact has been identified. The ability to distinguish long resistive pulses from step-like baseline changes is a key limitation of the established TS method or other offline approaches. The real-time recording introduced here makes it possible to consider prior knowledge about the analyte to distinguish between pore-clogging and long resistive pulses during the recording and it enables online intervention and artifact removal while the recording is still ongoing. Offline analysis may reveal artifacts only after the recording has already been completed.
The computational efficiency of peak-finding algorithms is essential for achieving instant analysis. FigureD plots the CPU time required to process each 20 ms iteration as the data are continuously streamed into the analysis program. For protein detection, we collect data at a high sampling rate of 500 kHz using a data acquisition (DAQ) card with a specific buffer size of 10,000 data points. When the buffer accumulates 10,000 data points (every 20 ms), the TSW algorithm, instantly, processes the collected data. It is important to note that although we analyzed the data stream in each 20 ms loop, the TSW algorithm analyzes the data streams by receiving a single point as input, and therefore, the analysis proceeds as each data point is acquired and, hence, is made available and processed in real-time. Moreover, smaller buffer sizes than 10,000 data points are feasible when using a real-time operating system such as FreeROTS.? The results in FigureD demonstrate that the TSW algorithm processes every 20 ms segment of data within 1 ms on average, hence keeping up with the analysis of data while it is being recorded. FigureE suggests that the overall time consumption of the TSW algorithm scales linearly with the size of input data, corresponding to a time complexity of O(n). For processing a current trace (i.e., 20 s) on a laptop CPU with a base clock frequency of 1.8 GHz, the TSW algorithm only takes 0.87 s, corresponding to an analysis speed of 40 MB/s for the data stream. We note that changes in baseline window size, w 1, do not affect the processing speed of the TSW algorithm (see Figure S7). The results reveal that the time consumption of the TSW algorithm is influenced solely by the number of recording data with no significant impact from other parameter settings.
Comparison of the TSW Algorithm and the Threshold Searching
Algorithm for the Detection of Resistive Pulses
To evaluate the performance of the TSW algorithm for the detection of resistive pulses (see Figure) and to compare it with the commonly used TS algorithm,? we simulated various current blockades by adding Gaussian noise to ideal square wave pulses. We generated these blockades at a frequency of 1 Hz and a sampling rate of 500 kHz, followed by filtering with a Gaussian low-pass filter with a cutoff frequency of 50 kHz. We compare the performance of resistive pulse detection for two different scenarios: various dwell times with the same baseline noise and constant standard deviation of intraevent modulations (FigureA), versus constant dwell time and baseline noise but various standard deviations of intraevent modulations (FigureB).
Comparison of the TSW and TS algorithms for dwell time analysis of resistive pulses. (A) Ideal square waves of varying pulse width overlapped with their generated blockades (black). The green line represents the ideal square wave with input dwell times ranging from 10 μs to 128 ms and a current amplitude of 500 pA. The black curve represents the generated blockades by adding Gaussian noise with a standard deviation of 50 pA to the ideal square wave. The dwell time t d is defined between the detected start and end point (red and blue), identified by the TSW or TS algorithms. (B) Generated blockades with different intraevent modulations. The green shaded regions indicate the current baseline with a noise of 50 pA and the standard deviation of intraevent modulations varying from 50 to 330 pA. All the generated blockades have a sampling rate of 500 kHz, followed by low-pass filtering with a Gaussian filter with a cutoff frequency of 50 kHz. (C) Median dwell time evaluated by the TSW (blue) and TS (red) algorithms as a function of input dwell time. The black line represents perfect agreement between the input dwell time and the determined dwell time. Inset: The input dwell time of generated blockades varies from 10 to 200 μs. (D) Median dwell time evaluated by the TSW (blue) and TS (red) algorithms as a function of standard deviation of intraevent modulations. The black dashed line represents the input dwell time of 100 μs for all the generated blockades. (E) Relative error and precision of detected blockades from the TSW (blue) and TS (red). The parameter n total represents the overall number of events that are detected. The parameter n correct represents the detected events that belong to the ideal square waves. (F) Box chart of dwell time evaluated by the TSW (blue) and TS (red) from the generated blockades of varying standard deviation of intraevent modulations. The black dashed line represents the input dwell time of 100 μs for all the generated blockades.
When the standard deviation of intraevent modulation is the same as that of baseline noise, both the TSW and TS algorithms evaluate the dwell time accurately, as shown in FigureC. However, the TS algorithm fails to detect long blockades with dwell times greater than 60 ms, whereas the TSW algorithm consistently detects these long events with accurate dwell times. FigureE shows that compared to the TS algorithm, the TSW algorithm exhibits a 10% higher relative error of dwell time for the resistive pulses with dwell times shorter than 30 μs, indicating that the TSW algorithm may be slightly less reliable than the TS algorithm for resistive pulses with t_d_ < 1.5 × (fc)^−1^. This error arises from noise fluctuations, which are challenging to distinguish from short events. The TS algorithm employs an additional fitting step on the rising edge and falling edge to accurately recover the entire short resistive pulses. However, as long as t d > 1.5 × (fc)^−1^, the errors of the TSW algorithm are comparable to those of the TS algorithm. Additionally, the precision of the TSW algorithm is, on average, 25% better than that of the TS algorithm (FigureE, bottom). To assess if the peak detection algorithms have selectively chosen only high-quality events, we statistically compared the detection rate of the TS and TSW algorithms. Figure S8 shows that the TS algorithm exceeds detection rates of 100%, while the TSW algorithm remains close to 100%. These results indicate that the TS algorithm misinterprets some noise spikes as translocation events, whereas the TSW algorithm detects all actual translocation events. These results demonstrate the robustness of the TSW algorithm in handling noise, which we attribute to the benefit conferred by the *z-*test criterion on determining the end of resistive pulses.
Protein translocation events exhibit significant fluctuations within each resistive pulse, so-called intraevent modulations, as shown in FigureC. These modulations arise from the rotation of particles with nonspherical shape in a nanopore and can be used to estimate the shape of proteins.? To verify the effectiveness of the TSW algorithm in handling resistive pulses with large intraevent modulation, we generated a series of blockades with standard deviations of intraevent modulations ranging from 50 to 350 pA (FigureB). FigureD shows that as intraevent modulations increase, the median dwell time determined by the TSW algorithm closely aligns with the input dwell time of 100 μs. In contrast, the dwell time determined by the TS algorithm gradually deviates by approximately 50% from the reference value when the standard deviation of intraevent modulation exceeds 250 pA. The box plot in FigureF further illustrates that the dwell times detected by the TSW algorithm remain consistent with the reference dwell time of 100 μs across different standard deviations of intraevent modulations. In contrast, the detected dwell time evaluated by the TS algorithm exhibits a broader distribution than the dwell time distribution evaluated by the TSW algorithm, and this distribution becomes further distorted with increasing standard deviation of intraevent modulation. The TS algorithm struggles to identify the correct duration of resistive pulses in the presence of large intraevent modulations, as these modulations significantly diminish the determination of event endpoints based on the single-point criterion of the TS algorithm. In contrast, the TSW algorithm effectively extracts the complete blockades with these intraevent modulations by using a *z-*test. This capability is crucial for protein characterization, as it enables accurate extraction of all current modulations induced by protein rotations or conformational changes.?
Next, we explored the performance of the TS and TSW algorithms on simulated protein translocation data (Supporting Note 2, Figures S6 and S9). The TSW algorithm shows a lower standard deviation of 0.04% on determining ΔI/I 0 values compared to 0.08% when using the TS algorithm. We evaluated the agreement between reference dwell times and determined dwell times as detected by TS and TSW using the sum of squared residuals (SSR). The SSR of the TS and TSW algorithms were 4.8 × 10^–6^ and 17.0 × 10^–6^, respectively, suggesting that the TSW algorithm has a slightly larger error in dwell time determination for protein translocation data. This increased error is due to the reduced accuracy of the TSW algorithm when determining dwell times of very short resistive pulses with dwell times close to the bandwidth limitation (t d ≤ 1.5 × (fc)^−1^). However, this limitation does not affect the characterization of individual proteins, as only events with dwell times greater than 2.5 × (fc)^−1^ were considered in the estimation of protein shape and volume. ?,? The results reveal that the detection of resistive pulses by the TSW algorithm performs as well as the previously employed TS algorithm for events with dwell times greater than 1.5 × (fc*)* ^–1^, while providing a 0.04% improvement in the standard deviation of ΔI/I 0 values compared to the TS algorithm under these conditions.
Instantaneous Determination of the Size and Shape of Single
Proteins and Their Classification
With instantaneous and accurate identification of resistive pulses from the incoming stream of data, we explored the application of the TSW algorithm for determining the shape and volume of proteins from experiments. Previously, we demonstrated the estimation of shape, volume, and other parameters of proteins from their free translocation through nanopores.? Due to bandwidth limitations when applying a low-pass filter with a cutoff frequency of 50 kHz, only resistive pulses longer than 50 μs can be used for population-based analysis,? and only resistive pulses longer than 150 μs are suitable for the analysis of protein shape from individual resistive pulses. Obtaining a sufficient number of long events for precise protein characterization using conventional offline peak detection methods remains challenging, and the required recording time is typically estimated empirically to be on the order of several minutes. The TSW algorithm offers the flexibility to terminate data acquisition once a sufficient number of analyzable events has been collected.
To test the usefulness of the TSW algorithm for real-time estimation of protein shape and volume, we conducted translocation experiments with Immunoglobulin G (IgG, oblate shape, m = 0.4) and thyroglobulin (Tg, prolate shape, m = 2.1) using solid-state nanopores with diameters of 25 and 35 nm and with a coating of PMOXA polymer. Following the flowchart in Figure, we detected translocation events and determined the shapes and volumes of the two separate proteins in real time using PyDaq (Figure). FigureA,B depict the current recording of IgG and Tg translocating through nanopores, respectively, with detected resistive pulses indicated by orange dots. The passage of individual IgG or Tg proteins resulted in characteristic resistive pulses, as indicated at the bottom of FigureA,B. To estimate the shape and volume of the proteins, we fit the cumulative distribution of the current within individual resistive pulses whose t d values exceeded 150 μs to a convolution model developed by Yusko et al.? FiguresC,D display the estimated length-to-diameter ratio (m) and volume (V) for IgG and Tg as a function of the cumulative residence time of detected resistive pulses. When the cumulative residence time of resistive pulses is less than 20 ms, the determined shape and volume of IgG fluctuate, with m between 0.2 and 0.8, and V between 200 and 360 nm^3^. These values then converge to m = 0.5 and V = 350 nm^3^ (m ref = 0.46, V ref = 332 nm^3^) once the cumulative number of resistive pulses exceeds 20 ms with a total of ∼50 resistive pulses (FigureC). The estimated shapes and volumes of Tg show larger uncertainty, with m ranging from 1.2 to 3.1 and V ranging from 500 to 3000 nm^3^ during the initial 40 ms, corresponding to the first 58 resistive pulses. Subsequently, these estimates converge to a value of m = 2.1 and V = 1500 nm^3^, which is close to the expected reference values of m ref = 1.9 and V ref = 1247 nm^3^. (FigureD). These results highlight that the estimated shape and volume remain unchanged over time once the cumulative residence time reaches a specific threshold. The presence of impurities in protein samples may prolong the time to establish a stable estimated value for shape and volume since the convergence of the estimated values is influenced by the cumulative residence time of resistive pulses from the protein and from the impurities. The TSW algorithm demonstrates its ability to extract events during data acquisition and is compatible with protein characterization methods for shape and volume estimation during experiments.
Real-time estimation of protein length-to-diameter ratio and volume from nanopore recordings using the TSW algorithm. (A, B) Representative current recordings of IgG (A) or Tg (B) translocations through nanopores, along with three examples of individual resistive pulses. Orange circles represent resistive pulses detected by the TSW algorithm. (C, D) Real-time measurements of the length-to-diameter ratio m, and volume V for IgG (C) and Tg (D), plotted as a function of cumulative residence time. The blue and red dotted lines mark the reference length-to-diameter ratio m (IgG = 0.46, Tg = 1.9) and volume V (IgG = 332 nm3, Tg = 1247 nm3) as estimated by minimum volume enclosed ellipsoids fitting based on solvent-excluded surface. Only resistive pulses with dwell times greater than 150 μs were included in the cumulative residence time and used for the shape and volume estimation. The overall recording times were 80 s for IgG and 210 s for Tg. (E) Crystal structures of IgG (PDB: 1HZH) and Tg (PDB: 6SCJ) overlaid with their best-fit ellipsoidal models (orange shadow).
Building on the capability of the TSW algorithm for instantaneous and accurate identification of resistive pulses, we integrated a Naïve Bayes Classifier (NBC)? into the TSW framework (TSW-NBC, Supporting Note 3 and Figure S10) to explore real-time protein classification and instantaneous feedback control. We evaluated the classification performance of the TSW-NBC using simulated translocation traces from a 'mixture' of five-proteins (FigureA). This mixture was designed so that three proteins shared the same volumes and three shared the same shapes (Supporting Figure S11). The resulting precision and recall matrices demonstrated robust performance of the classifier, with average values exceeding 90% across all classes (FigureB,C). Although the theoretical probability density function of the intraevent current of these five simulated proteins has a strong overlap (see Supporting Figure S11), the TSW-NB classifier nonetheless distinguished them with an accuracy of 97%. Notably, due to the computational simplicity of NBC, the overall time consumption of TSW-NBC increases only marginally (<3%) compared to the TS algorithms, thereby preserving the high performance of the original TSW algorithm (Supporting Figure S12).
Real-time classification of simulated protein mixtures using TSW-NBC. (A) A representative 20-s current trace of simulated translocations of five proteins in a mixture through a nanopore with a diameter of 20 nm and a length of 30 nm. The red line represents the actual position of simulated resistive pulses. For a realistic representation of the simulated currents, experimental noise was added. These resistive pulses are detected in real time and indicated with five differently colored triangles representing the class labels. Proteins are defined by shape (m) and volume (V): (1) m = 0.3, V = 150 nm3, (2) m = 0.5, V = 150 nm3, (3) m = 0.8, V = 150 nm3, (4) m = 0.8, V = 225 nm3, and (5) m = 0.8, V = 300 nm3. (B, C) The precision (B) and recall (C) matrices were calculated from classification results, quantifying the classification performance of the TSW-NBC algorithm for each category. The accuracy of this classification is 97%. See Supporting Video S1 and Figure S13 for the program’s runtime performance.
We implemented the Naïve Bayes Classifier during data acquisition by coupling the TSW-NBC with PyDAQ as a new submodule (Supporting Figure S13). Furthermore, we demonstrated the potential for low-latency triggering of an applied voltage based on the resulting classification labels, enabling immediate physical responses (see Supporting Figure S13 and Video S1).
A key advantage of this approach lies in the flexibility of the NBC parameters, which can be derived either experimentally from pure protein samples or theoretically from the probability density function of the magnitude of current blockade levels based on established physical models and protein crystal structures (i.e., data-driven versus physics-driven). This physics-driven approach is particularly valuable for analyzing samples that are highly heterogeneous and difficult to separate, such as amyloid oligomers. This new application illustrates one essential step toward single-molecule sorting as it provides instantaneous particle characterization while the particle is still in the nanopore or shortly after it exits the pore. This information may be combined with fast electronics that are capable of low-latency triggering of an applied voltage to sort particles in the future.
Conclusion
In conclusion, the work presented here introduces a real-time data analysis method for characterizing the shape and volume of proteins during data acquisition, providing instantaneous results within a few milliseconds and therefore before the next resistive pulse is recorded. The two sliding window (TSW) algorithm introduced here for analysis of resistive pulses enables instant event analysis with a constant speed of 40 MB/s using a CPU of 1.8 GHz base clock frequency, while the data acquisition had a speed of 2 MB/s at 500 kHz sampling rate. Using a variable-length window, the TSW algorithm determines the baseline in real time and remains unaffected by high event frequency, large intraevent modulations, or long-lasting resistive pulses. This algorithm provides more accurate baseline values compared to those determined by the threshold searching (TS) algorithm. Additionally, we incorporate the z-test criterion to determine the end of resistive pulses in the TSW algorithm, enhancing the robustness of peak detection for resistive pulses with large current modulations. Using the TSW algorithm, we developed an instantaneous protein characterization method enabling real-time estimation of the shape and volume of proteins during nanopore experiments. Furthermore, the integration of a Naïve Bayes classifier enables real-time protein classification with 97% accuracy based on the test of our simulation data sets. Crucially, this approach paves the way for a new application of single-molecule sorting through the real-time triggering of reverse bias upon classification. This method, hence, has the potential for real-time feedback control and enhanced storage efficiency, and may, in the future, enable sorting of individual molecules based on instantaneous analysis of the incoming data stream during their translocation through a nanopore.
Materials and Methods
Materials
The polymer poly(acrylamide)-g–PEG-PMOXA for coating solid-state nanopores was obtained from SuSoS AG, Switzerland. Thyroglobulin, human (T6830–1MG) was purchased from Sigma-Aldrich. Immunoglobulin G, human plasma (340–21) was purchased from Lee Biosolutions. Anotop 10 mm Syringe Filters with 20 nm pore size (6809–1002) were purchased from Fisher Scientific. Nanopores with a diameter of 25 or 35 nm in the freestanding 30 nm-thick silicon nitride membrane were ordered from Norcada Inc., Canada.
Surface Coating
Nanopores were coated using Poly(acrylamide)-g–PEG-PMOXA, as described by Awasthi et al.? After oxygen plasma cleaning (Diener Nano, 30% power, O_2_ atmosphere, 0.3 mbar) for 30 s on each side, the nanopore chips were immersed in 1 mg/mL Poly(acrylamide)-g–PEG-PMOXA polymer in a buffer solution (1 mM HEPES pH 7.4) and incubated for 1 h at room temperature. These chips were rinsed with ultrapure water three times and dried under a stream of N_2_ gas.
Electrical Recordings
We used Ag/AgCl pellet electrodes (Warner Instruments) to monitor the ionic currents during nanopore experiments using a patch-clamp amplifier (AxonPatch 200B, Molecular Devices) in voltage clamp mode with a 100 kHz lowpass Bessel filter. The data were acquired using a data acquisition card (NI PCI 6281, National Instrument) and our custom control software, PyDAQ, at a sampling rate of 500 kHz. Nanopore resistance was calculated by the slope of the ionic current at various applied voltages in the range of ±0.5 V. All experiments were performed in a recording buffer consisting of 2.0 M KCl, 10 mM HEPES, pH 7.4. This recording buffer was filtered by a membrane filter with a pore size of 20 nm prior to use.
Determination of Protein Volume and Shape
We converted the PBD structure of proteins to the solvent-excluded surface? using our custom depth-first search algorithm. Protein volumes were then calculated based on the solvent-excluded surfaces, employing a water probe with a diameter of 0.28 nm. To estimate the ellipsoid parameters (a, a, b), we used the minimum volume enclosing ellipsoids (MVEE) algorithm,? which optimizes the semiaxes based on the solvent-excluded surface to determine the ellipsoid shape. We provide an executable file to perform the shape and volume fitting for the PDB file upon request.
Computing Platform and Software
The TS algorithm is provided from a MATLAB script based on the method proposed by Pedone et al.? We wrote the TSW algorithm with C*++* and compiled it to a dynamic link library for usage in Python-based data acquisition software (PyDAQ, Figure S1). The control software, PyDAQ, was written with Python based on a Python library nidaqmx (National Instruments) and has been highly optimized for visualization. It includes a fast real-time algorithm for down-sampling that is integrated into the TSW algorithm, as well as image reuses, and multi-thread rendering. We implemented the simulation of ellipsoid translocations with C++ as discussed in Supporting Note 2. The simulation program and the TSW algorithm were tested on a laptop equipped with a 1.8 GHz Intel Core i5 CPU and 16 GB of memory. The associated software is available on our GitHub or upon request.
Supplementary Material
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Coulter, W. H. Means for Counting Particles Suspended In A Fluid. US Patent US 2656508 A, 1953.
- 2Yusko E. C.Bruhn B. R.Eggenberger O. M.Houghtaling J.Rollings R. C.Walsh N. C.Nandivada S.Pindrus M.Hall A. R.Sept D.Li J. L.Kalonia D. S.Mayer M.Real-time shape approximation and fingerprinting of single proteins using a nanopore Nat. Nanotechnol.201712436036710.1038/nnano.2016.26727992411 · doi ↗ · pubmed ↗
- 3Houghtaling J.Ying C. F.Eggenberger O. M.Fennouri A.Nandivada S.Acharjee M.Li J. L.Hall A. R.Mayer M.Estimation of Shape, Volume, and Dipole Moment of Individual Proteins Freely Transiting a Synthetic Nanopore ACS Nano 20191355231524210.1021/acsnano.8b 0955530995394 · doi ↗ · pubmed ↗
- 4Lucas F. L. R.Versloot R. C. A.Yakovlieva L.Walvoort M. T. C.Maglia G.Protein identification by nanopore peptide profiling Nat. Commun.2021121579510.1038/s 41467-021-26046-934608150 PMC 8490355 · doi ↗ · pubmed ↗
- 5Waduge P.He R.Bandarkar P.Yamazaki H.Cressiot B.Zhao Q.Whitford P. C.Wanunu M.Nanopore-Based Measurements of Protein Size, Fluctuations, and Conformational Changes ACS Nano 20171165706571610.1021/acsnano.7b 0121228471644 · doi ↗ · pubmed ↗
- 6Schmid S.Stommer P.Dietz H.Dekker C.Nanopore electro-osmotic trap for the label-free study of single proteins and their conformations Nat. Nanotechnol.202116111244125010.1038/s 41565-021-00958-534462599 · doi ↗ · pubmed ↗
- 7Sandler S. E.Horne R. I.Rocchetti S.Novak R.Hsu N. S.Cruz M. C.Brotzakis Z. F.Gregory R. C.Chia S.Bernardes G. J. L.Keyser U. F.Vendruscolo M.Multiplexed Digital Characterization of Misfolded Protein Oligomers via Solid-State Nanopores J. Am. Chem. Soc.202314547257762578810.1021/jacs.3c 0933537972287 PMC 10690769 · doi ↗ · pubmed ↗
- 8Soni N.Freundlich N.Ohayon S.Huttner D.Meller A.Single-File Translocation Dynamics of SDS-Denatured, Whole Proteins through Sub-5 nm Solid-State Nanopores ACS Nano 2022167114051141410.1021/acsnano.2c 0539135785960 PMC 7613183 · doi ↗ · pubmed ↗
