Discovery of a Hematopoietic Manifold in scGPT Yields a Method for Extracting Performant Algorithms from Biological Foundation Model Internals
Ihor Kendiukhov

TL;DR
This paper uncovers a biologically meaningful hematopoietic algorithm within the scGPT foundation model, demonstrating how to extract and validate a compact, interpretable algorithm that outperforms existing methods in pseudotime ordering and cell subtype classification.
Contribution
It introduces a novel three-stage extraction method to derive a compact, interpretable hematopoietic algorithm from scGPT, validated across multiple datasets and benchmarks.
Findings
Extracted algorithm achieves superior pseudotime ordering.
Outperforms baseline methods on key cell subtype endpoints.
Mechanistic interpretability reveals core gene programs.
Abstract
We report the discovery and extraction of a compact hematopoietic algorithm from the single-cell foundation model scGPT, to our knowledge the first biologically useful, competitive algorithm extracted from a foundation model via mechanistic interpretability. We show that scGPT internally encodes a compact hematopoietic manifold with significant developmental branch structure, validated on a strict non-overlap Tabula Sapiens external panel and confirmed via frozen-head zero-shot transfer to an independent multi-donor immune panel. To isolate this geometry, we introduce a general three-stage extraction method consisting of direct operator export from frozen attention weights, a lightweight learned adaptor, and a task-specific readout, producing a standalone algorithm without target-dataset retraining. In 88-split donor-holdout benchmarks against scVI, Palantir, DPT, CellTypist, PCA, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSingle-cell and spatial transcriptomics · Cell Image Analysis Techniques · Domain Adaptation and Few-Shot Learning
