Sparse autoencoders reveal organized biological knowledge but minimal regulatory logic in single-cell foundation models: a comparative atlas of Geneformer and scGPT
Ihor Kendiukhov

TL;DR
This study uses sparse autoencoders to analyze single-cell foundation models, revealing they encode organized biological knowledge but lack significant causal regulatory logic, with extensive feature atlases publicly available.
Contribution
First systematic application of sparse autoencoders to biological foundation models, uncovering their internal biological organization and minimal causal regulatory information.
Findings
Models encode rich biological pathways and modules.
Features show hierarchical and co-activation structures.
Limited causal regulatory logic detected in model features.
Abstract
Background: Single-cell foundation models such as Geneformer and scGPT encode rich biological information, but whether this includes causal regulatory logic rather than statistical co-expression remains unclear. Sparse autoencoders (SAEs) can resolve superposition in neural networks by decomposing dense activations into interpretable features, yet they have not been systematically applied to biological foundation models. Results: We trained TopK SAEs on residual stream activations from all layers of Geneformer V2-316M (18 layers, d=1152) and scGPT whole-human (12 layers, d=512), producing atlases of 82525 and 24527 features, respectively. Both atlases confirm massive superposition, with 99.8 percent of features invisible to SVD. Systematic characterization reveals rich biological organization: 29 to 59 percent of features annotate to Gene Ontology, KEGG, Reactome, STRING, or TRRUST,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSingle-cell and spatial transcriptomics · Pluripotent Stem Cells Research · Gene Regulatory Network Analysis
