What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses
Ihor Kendiukhov

TL;DR
This study investigates the geometric and topological structures learned by biological foundation models during gene expression analysis, revealing genuine, shared, and tissue-specific structures through large-scale hypothesis testing.
Contribution
It introduces an autonomous AI-driven framework for hypothesis screening and demonstrates that models learn meaningful, shared geometric structures with tissue-specific localization.
Findings
Models learn significant topological structures in gene embeddings.
Shared global shape of gene space across models, but with differences in gene placement.
Robust signals are tissue-specific, especially in immune tissues.
Abstract
When biological foundation models such as scGPT and Geneformer process single-cell gene expression, what geometric and topological structure forms in their internal representations? Is that structure biologically meaningful or a training artifact, and how confident should we be in such claims? We address these questions through autonomous large-scale hypothesis screening: an AI-driven executor-brainstormer loop that proposed, tested, and refined 141 geometric and topological hypotheses across 52 iterations, covering persistent homology, manifold distances, cross-model alignment, community structure, and directed topology, all with explicit null controls and disjoint gene-pool splits. Three principal findings emerge. First, the models learn genuine geometric structure. Gene embedding neighborhoods exhibit non-trivial topology, with persistent homology significant in 11 of 12…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSingle-cell and spatial transcriptomics · Genomics and Chromatin Dynamics · Bioinformatics and Genomic Networks
