Beyond the Alphabet: Deep Signal Embedding for Enhanced DNA Clustering
Hadas Abraham, Barak Gahtan, Adir Kobovich, Orian Leitersdorf, Alex M., Bronstein, Eitan Yaakobi

TL;DR
This paper introduces a deep neural network that directly embeds raw Nanopore sequencing signals for DNA clustering, improving accuracy and speed over traditional basecalling-based methods in DNA data storage.
Contribution
The work presents a novel deep signal embedding approach for DNA clustering that operates directly on raw sequencing signals, bypassing basecalling.
Findings
Superior clustering accuracy compared to traditional methods
Reduced computational time in DNA clustering process
Effective use of raw Nanopore signals for DNA storage applications
Abstract
The emerging field of DNA storage employs strands of DNA bases (A/T/C/G) as a storage medium for digital information to enable massive density and durability. The DNA storage pipeline includes: (1) encoding the raw data into sequences of DNA bases; (2) synthesizing the sequences as DNA \textit{strands} that are stored over time as an unordered set; (3) sequencing the DNA strands to generate DNA \textit{reads}; and (4) deducing the original data. The DNA synthesis and sequencing stages each generate several independent error-prone duplicates of each strand which are then utilized in the final stage to reconstruct the best estimate for the original strand. Specifically, the reads are first \textit{clustered} into groups likely originating from the same strand (based on their similarity to each other), and then each group approximates the strand that led to the reads of that group. This…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
This paper studies an important problem in DNA storage. It has long been an open question whether systems can utilize information in the raw signals to better improve the analysis and downstream use of Nanopore data. The authors show the effectiveness of taking a pre-trained model (Dorado) for Nanopore basecalling and repurposing it for a downstream application of the raw signals. This is pretty interesting because it is good to know the pre-trained model can be useful, and researchers do not h
* A major weakness of the approach seems to be that the model needs to be trained with an output dimension equal to the number of clusters. However, the number of clusters in real settings can be very, very large. E.g., for a file that is stored that is 100MB, there could easily be over 1M clusters. I don't think network training will scale well in this case. So the experiments on 500 clusters are not representative of a real setting. Concretely, can the authors address how their approach might
Originality: The paper addresses a novel problem by introducing raw signal clustering in DNA storage, which could set a new direction for handling sequencing data. Relevance: With data storage demands growing exponentially, improving DNA storage efficiency and accuracy is highly relevant. Computational Efficiency: The use of deep embeddings and cosine similarity yields significant computational improvements, making the method scalable and suitable for high-throughput sequencing. Clarity in
Limited Generalizability Due to Custom Dataset: The dataset is highly customized, relying on specific design files and synthetic DNA sequences generated by a particular synthesis provider (Twist Bioscience). Since real-world DNA samples often include much higher variability, especially in natural genomic data, the results from this dataset may not generalize well to other applications or to DNA data with biological origins rather than synthetic sources. Fixed Threshold for Edit Distance (k=20)
- The core contribution is a model, based on the model in the Dorado basecaller, that performs clustering on these raw signals before basecalling. To my knowledge, this signal-based approach is novel and promising in the field of DNA storage. - The authors evaluate their method on several DNA datasets. Their results show that signal-based clustering outperforms existing methods in terms of both time and accuracy.
Major: - An important point that the authors should address is that it is uncertain whether the clustering gains translate into significant improvements in the final data retrieval phase. Minor: - Line 40: The description 'a "retrieval" stage where reads are decoded back to binary data files while correcting any errors using the chosen coding methods' is inconsistent with Figure 1, where the decoding is shown to occur after the retrieval stage. - Line 80: "edits" should be replaced by "substitu
The approach of commencing the analysis from raw Nanopore signals, rather than relying on pre-processed discrete DNA sequences, represents a novel direction in the field. Sequence clustering is a challenging problem, especially when dealing with a large number of sequences. The proposed method may have significant impact on the DNA storage community.
There are several weaknesses that the authors may need to address. 0. The authors should pay more attention to the typos, grammars, etc. e.g line125 ", while those that are not are for apart. " 1. There are existing works that employs deep learning techniques to assess the similarity between sequences (DNA sequences), which are very closely related to this manuscript but omitted in this work. e.g. ``` 1. "Convolutional embedding for edit distance." SIGIR 2020 2. "Neural distance embeddings
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Fractal and DNA sequence analysis · Machine Learning in Bioinformatics
