TL;DR
This paper introduces hyperbolic CNNs for genome modeling, achieving superior performance on multiple benchmarks by capturing evolutionary structure without explicit phylogenetic mapping.
Contribution
The work presents a novel hyperbolic CNN approach for genomic sequences, outperforming Euclidean models and state-of-the-art methods while using fewer parameters and no pretraining.
Findings
Outperforms Euclidean models on 37/42 genome datasets
Surpasses state-of-the-art on 7 GUE benchmark datasets
Introduces the Transposable Elements Benchmark dataset
Abstract
Current approaches to genomic sequence modeling often struggle to align the inductive biases of machine learning models with the evolutionarily-informed structure of biological systems. To this end, we formulate a novel application of hyperbolic CNNs that exploits this structure, enabling more expressive DNA sequence representations. Our strategy circumvents the need for explicit phylogenetic mapping while discerning key properties of sequences pertaining to core functional and regulatory behavior. Across 37 out of 42 genome interpretation benchmark datasets, our hyperbolic models outperform their Euclidean equivalents. Notably, our approach even surpasses state-of-the-art performance on seven GUE benchmark datasets, consistently outperforming many DNA language models while using orders of magnitude fewer parameters and avoiding pretraining. Our results include a novel set of benchmark…
Peer Reviews
Decision·ICLR 2025 Poster
Hyperbolic architectures is a promising avenue for training models on genomic sequences, as they are inherently related through phylogenetic tree structures. This paper provides interesting initial results in this direction.
- How does the CNN/HCNN performance scale with number of parameters? Figure 5 seems to indicate that performance gets worse with increasing hidden dim? Additionally figure 5 should include error bars. Also the cumulative improvement y-axis is unclear. It should just be the average MCC value. - Figure 1 shows that sequences which obey a phylogenetic tree structure are embedded into a hyperbolic space which learns this structure. However I do not see any results which indicate that the phylogene
1. The paper is well-written, with good backgrounds and clear visualization. 2. The introduced Transposable Elements Benchmark is valuable to the research community. 3. The hyperbolic CNN demonstrates consistent improvements over the Euclidean CNN. It could serve as a baseline choice for genome classification models.
1. Comparing the number of parameters between CNN-based models and Transformer-based models is not convincing. Due to the underlying difference between these two model architectures, fewer parameters do not guarantee efficiency. A more suitable comparison would be the time and memory usage of training the models on the same dataset. 2. There is a lack of baselines. As a CNN-based model, the model should also be compared with HyenaDNA and Caduceus. 3. The empirical study is not well-presented. In
- The authors analyze an existing problem through a novel lens suggesting a new possible inductive bias to an important problem. - The authors perform a comprehensive evaluation across a wide variety of tasks - Authors propose a new dataset which can evaluate ability of models to classify transposable elements. This class of sequences are widely abundant however it is unclear as to their biological significance compared to sequences encoding mature RNAs.
- This work can benefit from additional justification for why the hierarchical tree structure is a good inductive bias for genomic data. Although the original data generating process can take on a tree structure is it necessarily the case that we would want to add this as an inductive bias to the model? - Is there an experiment that the authors can do that would confirm the validity of this inductive bias? Do the authors find it strange that the synthetic task that was explicitly designed to te
Videos
