Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations
Ihor Kendiukhov

TL;DR
This study reveals that single-cell transformer models encode biological knowledge in a structured, interpretable geometric space, reflecting cellular organization, protein interactions, and gene regulation.
Contribution
It systematically decodes the spectral geometry of transformer representations, uncovering biologically meaningful axes and structures within the model.
Findings
Genes organized by subcellular localization along spectral axes
Intermediate layers encode cellular compartments in sequence
Model distinguishes transcription factors from target genes with AUROC 0.744
Abstract
Single-cell foundation models such as scGPT learn high-dimensional gene representations, but what biological knowledge these representations encode remains unclear. We systematically decode the geometric structure of scGPT internal representations through 63 iterations of automated hypothesis screening (183 hypotheses tested), revealing that the model organizes genes into a structured biological coordinate system rather than an opaque feature space. The dominant spectral axis separates genes by subcellular localization, with secreted proteins at one pole and cytosolic proteins at the other. Intermediate transformer layers transiently encode mitochondrial and ER compartments in a sequence that mirrors the cellular secretory pathway. Orthogonal axes encode protein-protein interaction networks with graded fidelity to experimentally measured interaction strength (Spearman rho = 1.000…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSingle-cell and spatial transcriptomics · Cell Image Analysis Techniques · Gene Regulatory Network Analysis
