The cell as a token: high-dimensional geometry in language models and cell embeddings

William Gilpin

PMC · DOI:10.1093/bioinformatics/btaf595·October 30, 2025

The cell as a token: high-dimensional geometry in language models and cell embeddings

William Gilpin

PDF

Open Access

TL;DR

This paper compares how language models and cell embeddings use high-dimensional spaces, suggesting insights from language models can improve cell analysis.

Contribution

It introduces a novel perspective by linking language model advancements to single-cell data analysis techniques.

Findings

01

Token context influences embedding space geometry in both language and cell data.

02

Low-dimensional manifolds affect the robustness and interpretation of embedding spaces.

03

Language model techniques like interpretability probes can enhance virtual cell model training.

Abstract

Single-cell sequencing technology maps cells to a high-dimensional space encoding their internal activity. Recently-proposed virtual cell models extend this concept, enriching cells’ representations based on patterns learned from pretraining on vast cell atlases. This review explores how advances in understanding the structure of natural language embeddings informs ongoing efforts to analyze single-cell datasets. Both fields process unstructured data by partitioning datasets into tokens embedded within a high-dimensional vector space. We discuss how the context of tokens influences the geometry of embedding space, and how low-dimensional manifolds shape this space’s robustness and interpretation. We highlight how new developments in foundation models for language, such as interpretability probes and in-context reasoning, can inform efforts to construct cell atlases and train virtual…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes2

CD4 CD8A

Proteins2

Species3

Danio rerio(leopard danio · species)Homo sapiens(human · species)Mus musculus(house mouse · species)

Diseases1

inflammation

Figures3

Click any figure to enlarge with its caption.

Low-rank structure in high-dimensional embeddings. (A) An embedding of the full text of the novel Blood Meridian (McCarthy 1985) using a word2vec model originally trained on a dataset of 1011 words drawn from Google News articles (Mikolov et al. 2013a). Vectors are clustered using K-means partitioning, and then summarized into metagroups with a topic embedding model (colors and annotations). (B) An embedding of 6×104 human peripheral blood mononuclear cells based on single-cell RNA sequencing of 1.6×104 genes. Colors correspond to immune cell subtypes, as determined by marker genes for characteristic cell surface proteins like CD4, CD8, etc.

Analogies and low-dimensional manifolds. (A) Embeddings of particular sequences of tokens using the model of Fig. 1, with examples of escalating manifolds (red and blue lines), which overlap in regions with similar meaning (weak polysemy). A token with strong polysemy appears at an intermediate location (purple circle). An example of an analogy relationship encoded as nearly-congruent difference vectors (turquoise arrows). While nonlinear embedding methods like UMAP distort the local metric over large scales (Chari and Pachter 2023), the nearby position of the two analogy vectors’ heads and tails protects their congruency. (B) RNA Velocity applied to developing endocrine cells in the pancreas (Bastidas-Ponce et al. 2019; La Manno et al. 2018). Vectors correspond to development direction, and color corresponds to pseudotime assigned via diffusion components. Cell types along the differentiation axis are overlaid.

Mechanistic interpretability in single-cell foundation models. (A) Common architectural features and target tasks for single-cell foundation models. (B) Mechanistic interpretability methods for single cell embeddings. (Left) Intrinsic dimensionality may be calculated directly from expression profiles, or from internal activations of the model. Inset shows the intrinsic dimensionality of staged expression profiles from developing mice. Panel adapted from Ref. (Biondo et al. 2024). (Right) Sparse autoencoders are trained in an unsupervised manner to reconstruct internal activations of foundation models, by mapping activations to sparse combinations of features in a latent dictionary. Inset shows application of sparse autoencoders to the activations of the Universal Cell Embedding model on a dataset of human bone marrow. The left subpanel corresponds to annotated cell types, while the right corresponds to the decoding of a single latent unit. Panels adapted from Ref. (Schuster 2024).

Funding4

—Chan Zuckerberg Initiative DAF
—Silicon Valley Community Foundation10.13039/100000923
—NSF DMS
—NSF CMMI

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCell Image Analysis Techniques · Cellular Mechanics and Interactions · Single-cell and spatial transcriptomics

Full text

1 Introduction

Modern single-cell experiments decompile the cell—abstracting it away from its squishy context, and rendering it as a single point in a high-dimensional vector space. Computational workflows attempt to invert this process: spatial transcriptomics recovers information about a cell’s position, while lineage tracing reconstructs developmental stages. Recent efforts to construct virtual cells—massive machine learning models built upon language model architectures—represent the next step of this process. Trained on vast amounts of genomic data, these models aim to provide richer, more informative representations than the raw count matrices produced by single cell experiments.

How do we know if the representations learned by virtual cell models are meaningful? If a single-cell embedding exactly matches known regulatory and developmental mechanisms, an embedding space may be accurate yet uninformative for making new discoveries. Conversely, if this space fails to recapitulate known relationships, it is difficult to attribute this to novelty or inaccuracy. A similar problem arises in statistical learning. Large language models are trained on vast, unannotated volumes of text. To represent diverse text sources consistently, these models initially split input text into discrete tokens: minimal units consisting of words or word fragments. They then convert these tokens into vectors in a high-dimensional space, a representation that enables further processing by modern, continuously-valued learning models.

The success of language models stems, in part, from the ability of language embeddings to accurately encode syntactic and semantic structure in high-dimensional spaces. The unique properties of high-dimensional geometry allow embeddings to effectively encode the semantic structure of language along low-dimensional manifolds, mirroring findings from single-cell biology in identifying developmental pathways and rare cell types. What insights can single-cell embeddings gain from the statistical learning community? Here, we review recent developments and commonalities between these fields, highlighting general principles of language embeddings that may inform ongoing work in single-cell genomics.

2 Context shapes the geometry of embeddings

Modern large language models owe their scale to self-supervised training, which obviates the need to collect expensive labeled training data. Given a sequence of words, ”The parliamentarian led the assembly.”, a single word is masked and the model is trained to fill in the blank: ”The [TOK] led the assembly.” When trained at scale, models learn to group certain words (e.g. ”leader,” ”president,” ”speaker”) that appear in similar contexts. Theoretical motivation for context masking comes from the distributional hypothesis, which equates distances between vector representations of different words in embedding space, with distances between distributions of co-occuring tokens within the training corpus (Harris 1954; Firth 1957). The distributional hypothesis typically describes co-occurrence statistics: words like ”president” or ”parliamentarian” often appear with similar other words. However, the distribution may be further conditioned on language type, historical era, domain-specific register, or other latent variables that modulate word usage. In systems biology, the distributional hypothesis motivates efforts to train self-supervised foundation models from single cell data, often termed ”virtual cells” (Bunne et al. 2024). An implicit assumption of such approaches is that models can learn informative, predictive knowledge purely from training to be self-consistent. Such approaches inherently invoke a distributional hypothesis—that cells occurring in the same tissues, interactions, or regulatory roles ought to retain that similarity when represented in a single-cell workflow, and that this similarity can be exploited for self-supervised training.

Predating modern large language models, word2vec language embeddings introduced an early notion of context to word representations. During training, word2vec directly optimizes an objective function motivated by the distributional hypothesis, producing an embedding that maximizes the posterior probability of word-context pairs seen in the corpus, while minimizing the probability of randomly-generated pairs (Mikolov 2013). This approach represents contrastive learning, which allows an embedding space to be constructed for data that otherwise lacks a well-defined distance metric. A typical fully-trained word2vec model maps each of $[eqn]$ distinct words to a point in a 300-dimensional continuous vector space. Because the distributional objective is optimized only during training on the text corpus, word2vec produces static embeddings: after training, any appearance of a given token always maps to the same point in embedding space.

Cell gene expression profiles lack an an obvious distance metric, and the results of computational workflows like cell type clustering vary depending on the choice of cell-cell distance metric such as Euclidean distance, correlation, or t-statistic (Ji et al. 2023). Raw expression profiles are typically context-independent. After isolation, sequencing, and demultiplexing, a cell becomes a collection of RNA transcripts, each of which may be considered a vector approximating the transcript counts per gene per cell (Stuart and Satija 2019). The expression levels of each gene thus uniquely determine the embedding, decoupling a given cell’s representation from others. Thus, in principle, raw count data do not invoke the distributional hypothesis: a cell’s embedding is an innate property, rather than a property relative to a corpus of cells. Many preprocessing schemes applied to count matrices—such as batch or cell cycle correction—enforce static, context-free structure in embedding space (Korsunsky et al. 2019). However, data reduction methods like principal component analysis for visualization, or unsupervised clustering for cell type identification, produce context-dependent representations that depend on relative differences among cells. Context-dependence also arises when multiple datasets are merged, or when end-to-end embedding models are trained across many datasets. However, such approaches stop short of invoking the distributional hypothesis, because they do not enforce a notion of context tied to co-occurrence statistics. In contrast, the extensive pretraining used in modern single-cell foundation models aims to learn a distance metric among expression profiles based on statistical patterns in expression across the training data (Heimberg et al. 2025).

3 The geometry of embedding spaces

Theoretical analysis of word2vec and its variants shows that these methods, in practice, factorize a matrix representing the mutual information between the distribution of each token across the corpus, and the distribution of its context (Fig. 1A) (Levy and Goldberg 2014). Linguistic structures, such as synonym clusters, lead to low-rank structure in this matrix (Dhillon et al. 2015; Allen et al. 2019), similar to the low-rank structure that forms in single-cell count matrices due to statistical similarities in the expression vectors of cells belonging to the same type (Fig. 1B) (Nitzan et al. 2019). Low-dimensional manifolds in single-cell embeddings typically arise from highly-coordinated biological processes, such as differentiation, which exhibit predominantly deterministic dynamics. In language embeddings, low-rank structure arises due to overparametrization—the highest-rank word-context matrix would simply represent an isotropic Gaussian distribution. Anisotropy in high-dimensional embeddings thus indicates structure in the underlying generative process, whether linguistic or biological.

Low-rank structure in high-dimensional embeddings. (A) An embedding of the full text of the novel Blood Meridian (McCarthy 1985) using a word2vec model originally trained on a dataset of 1011 words drawn from Google News articles (Mikolov et al. 2013a). Vectors are clustered using K-means partitioning, and then summarized into metagroups with a topic embedding model (colors and annotations). (B) An embedding of 6×104 human peripheral blood mononuclear cells based on single-cell RNA sequencing of 1.6×104 genes. Colors correspond to immune cell subtypes, as determined by marker genes for characteristic cell surface proteins like CD4, CD8, etc.

A key limitation of static language embeddings stems from polysemy, in which the same token has multiple meanings (Garí Soler and Apidianaki 2021; Liu et al. 2020). For example, ”bank” may refer to the side of a riverbed, or to a financial institution. Static word embeddings like word2vec place polysemous tokens at intermediate positions in embedding space, between positions associated with their divergent meanings. Such compromises distort and curl embedding space, reducing the space’s ability to represent large-scale structure by making distances between vectors less meaningful (Jakubowski et al. 2020; Neelakantan et al. 2014; Goel et al. 2022). As a result, static embedding models tend to underestimate differences among strongly-distinct concepts, limiting their ability to recognize hierarchies among words (Nickel and Kiela 2017). In gene expression, curvature due to polysemy may arise due to biological processes, rather than artifacts. Cellular differentiation datasets exhibit low curvature in regions associated with stereotyped cell states, punctuated by high-curvature regions associated with transitions (Sritharan et al. 2021). These transition states, such as differentiating stem cells, occupy intermediate locations in embedding space (Wang et al. 2020). However, spurious polysemy can also arise due to technical errors, leading to unresolved cell subtypes or tissue groups. This effect becomes more pronounced at low read depth, resulting in missing genes or greater sampling error in counts. For example, blood vascular endothelial cells share relatively consistent transcriptional profiles, due to their similar structural roles across different tissues (Kalucka et al. 2020). Endothelial cells from different tissues often map to the same area in embedding space, despite their anatomical separation. Resolving this ambiguity either requires additional marker genes and greater sequencing resolution, or assays that barcode transcripts with additional information. For example, cell painting produces a high-dimensional vector of morphological features extracted from fluorescence microscopy (Bray et al. 2016), while CITE-Seq augments each transcript with information about cell surface proteins—thus avoiding cellular polysemy (Stoeckius et al. 2017).

Contemporary language models use dynamic token embeddings, in which a given token’s embeddings varies based on its context after training (Devlin et al. 2019; Liu et al. 2020; Reisinger and Mooney 2010). The standard mechanism, self-attention, combines a token’s static representation, neighboring context tokens, and a positional encoding (Vaswani et al. 2017). The resulting joint representation thus varies even after training when it appears in new contexts. Thus, while static embeddings associate each token with a single embedding point, dynamic embeddings map each token to a cloud of points capturing the diverse contexts in which it appears. The distance between the same token in different contexts is smaller than the distance between tokens, consistent with low-dimensional, anisotropic structure (Ethayarajh 2019).

In large-scale gene expression datasets like cell atlases, dynamic cell embeddings improve the global structure of representations. Spatial transcriptomics augments each transcript with information about the cell’s absolute spatial position, or relative position among neighboring cells. As a result, embeddings learned by these methods encode an underlying metric, and so both local and global distances are meaningful (Tian et al. 2023; Nitzan et al. 2019). More abstract context information, such as organ group or tissue annotations, improves embeddings by disambiguating similar transcriptional profiles arising in distinct contexts (Xu et al. 2021). Conceptually, these approaches resemble language models that combine tokenization with queries to an external database that provides richer context (Borgeaud et al. 2022; Khandelwal et al. 2020). This can include structured sources of information about tokens, like encyclopedias or human-curated concept maps (Speer et al. 2017; Zhang et al. 2019). The resulting models capture global relationships among concepts, without requiring substantial additional training. For single-cell data, similar approaches enrich transcript information with tissue or preexisting cell type annotations (Brbić et al. 2020; Lin et al. 2022; Lotfollahi et al. 2022), gold-standard experimental associations or transcription factors (Lee et al. 2024), gene ontologies (Yuan et al. 2024), or even topic information from scientific literature databases (Zhao et al. 2021; Istrate et al. 2024). Other efforts pair each cell with gene-level context (such as sequence position or chromatin accessibility) to highlight mechanistic relationships (Chen et al. 2024; Fu et al. 2025).

When single cell foundation models are trained using self-supervision, their internal representations can be extracted and used as dynamic embeddings of expression vectors (Fang et al. 2024; Song et al. 2021; Eraslan et al. 2019; Fu et al. 2025; Zhao et al. 2021; Istrate et al. 2024). Multimodal models produce richer representations by training on both expression data and external information, such as known regulatory hierarchies (Zhao et al. 2021; Istrate et al. 2024; Bunne et al. 2024; Lopez et al. 2018). Many such approaches use self-attention to dynamically process tokens, as well as contrastive learning, producing representations with an underlying similarity metric—representing a form of distributional hypothesis for expression vectors (Cui et al. 2024; Theodoris et al. 2023; Vaswani et al. 2017; Han et al. 2022).

4 Are cells or genes the “words” of single-cell biology?

Many large-scale pretrained models for genomic data directly adapt language architectures, treating the genome as a large body of text, with nucleotides acting as an alphabet and genes as words (Consens et al. 2025; Ji et al. 2021; Levine et al. 2024; Rizvi et al. 2025; Rosen et al. 2023; Cui et al. 2024). Genes thus may seem to be a more natural analogue to words in statistical learning frameworks. However, this equivalence has limits: genes do not recur within a single genome, and so identifying variations in their function across cells or individuals requires expression information, a quantity without an obvious analogue in natural language. Instead, from the perspective of the distributional hypothesis, cells, not genes, represent minimal tokens, because similarity among cells can be inferred from recurring patterns across different biological contexts. As a result, many proposed applications of virtual cell models, such as cell type identification or lineage tracing, implicitly treat cells as words (Bunne et al. 2024; Pearce et al. 2025). Context thus arises from neighboring cells, tissue microenvironments, or developmental stages, while genes represent the fixed vocabulary describing each cell token. Ambiguities about the correct unit of tokens also exist in language models: while many language models treat words as tokens, others use finer-grained units such as characters (Boukkouri et al. 2020) or even raw byte sequences (Xue et al. 2022). Similarly, in biological settings, the choice of ”token” is not fixed a priori, but should be defined at the level where meaningful context recurs.

5 Analogies as manifolds in embedding space

Effective language embeddings encode semantic relationships as distances (Mikolov et al. 2013b). For example, the vector from ”Sacramento” to ”California” in embedding space may match the vector from ”Austin” to ”Texas.” As a result, vector arithmetic in word2vec solves unseen analogy problems from college admissions exams, even without retraining (Fig. 2A)(Liu et al. 2017; Turney and Littman 2005). Embedding space thus unfolds computation into a higher-dimensional space in which reasoning coincides with distances (Ushio et al. 2021). In this sense, early word embeddings foreshadowed modern works on in-context learning and zero-shot inference, phenomena in which sufficiently-large models are able to perform new tasks not seen during training (Kojima et al. 2022).

Analogies and low-dimensional manifolds. (A) Embeddings of particular sequences of tokens using the model of Fig. 1, with examples of escalating manifolds (red and blue lines), which overlap in regions with similar meaning (weak polysemy). A token with strong polysemy appears at an intermediate location (purple circle). An example of an analogy relationship encoded as nearly-congruent difference vectors (turquoise arrows). While nonlinear embedding methods like UMAP distort the local metric over large scales (Chari and Pachter 2023), the nearby position of the two analogy vectors’ heads and tails protects their congruency. (B) RNA Velocity applied to developing endocrine cells in the pancreas (Bastidas-Ponce et al. 2019; La Manno et al. 2018). Vectors correspond to development direction, and color corresponds to pseudotime assigned via diffusion components. Cell types along the differentiation axis are overlaid.

Organismal cell atlases exhibit well-defined clusters associated with cell and tissue types. However, cells with different compositions but similar functions can nonetheless occupy similar relative locations in embedding space. For example, immune cells form subtypes within different organ groups, such as Kupffer cells in the liver or microglia in the brain. In whole-organism cell atlases, these cells typically appear in separate clusters associated with their primary organ groups. However, within each organ cluster, they occupy similar positions relative to other cells, underscoring their analogous roles (Wang et al. 2022; Suo et al. 2022; Gautier et al. 2012).

Language embeddings also capture continuous relationships among tokens. For example, escalating sequences like ”good,” ”better,” or ”best” map to linear sequences in word2vec (Mikolov et al. 2013b). Generally, word embeddings exhibit high anisotropy (Mimno and Thompson 2017; Ethayarajh 2019), with embeddings spanning low-dimensional manifolds within the higher-dimensional representation space. Depending on the language, this manifold has effective dimensionality $[eqn]$ , even when the feature dimension is $[eqn]$ (Mu et al. 2018). These manifolds capture gradations in meanings among similar words, shifts in a word’s meaning over time, singular-plural pairs, or even groups of synonyms (Hamilton et al. 2016). For example, in dynamic embeddings produced by large language models, days of the week and calendar months map onto circular manifolds, while colors and years map onto linear manifolds (Engels et al. 2025; Modell et al. 2025). Consistent with these manifolds representing informative subspaces, the performance of embeddings in downstream tasks initially increases with the embedding dimension, but it eventually plateaus at a fixed multiple of the manifold dimension (Yin and Shen 2018).

Similar low-dimensional manifolds arise in cell embeddings. Across different datasets, cell replication cycles and circadian rhythms form rings (Kowalczyk et al. 2015), spatially-extended tissues form grids (Adler et al. 2019; Nitzan et al. 2019), and cell differentiation hierarchies form branches (Fig. 2B) (Paul et al. 2015). In one well-known case, populations of blood cells of mixed maturity form a pitchfork in embedding space, illustrating a continuous progression from stem cells to different types of blood cells (Paul et al. 2015). Gene expression manifolds have a typical intrinsic dimensionality $[eqn]$ , compared to the $[eqn]$ genes measured in a typical single-cell experiment (Sritharan et al. 2021). Just as word embeddings trace their properties to low-rank structure in the word-context mutual information matrix, biological processes confer low-rank structure on count matrices (Nitzan and Brenner 2021; Thibeault et al. 2024).

Theoretical models of word embeddings frame text generation as a stochastic dynamical system, with sentence formation as a random walk in token embedding space (Arora et al. 2016, 2015; Hashimoto et al. 2016). Under this framework, semantic manifolds act as kinetic traps for the walk, with synonymous tokens acting as basins, and connective phrases acting as bridges. In single-cell analysis, diffusion maps simulate the action of many random walks through expression space (Coifman and Lafon 2006; Wolf et al. 2019). These methods represent a standard approach to calculating pseudotime, which orders unsorted cell embeddings to identify temporal progressions of cells (like developmental stages) (Haghverdi et al. 2016; Setty et al. 2019). In dynamic word embeddings, or in static embeddings trained on corpora from different historical periods, the relative positions of words gradually shift over time. This semantic drift may be quantified using a calculation resembling pseudotime (Bamler and Mandt 2017).

6 Cross-lingual embeddings

Many features of natural language, such as parts-of-speech, intensifiers, and modifiers, recur across languages. For example, while English and Sanskrit have different inflections and character sets, they exhibit similar verb conjugations and noun declensions. Machine translation models must disentangle these distinctions to construct maps between different languages’ embedding spaces. A common approach is an encoder-decoder translation model, which trains a model to map sentences onto a universal representation in a latent space (Wu et al. 2016). For example, a Sanskrit encoder maps a sentence into the latent space, and an English decoder then translates it. Syntax information, such as word ordering or inflection, is typically distinct among languages and thus not necessarily preserved in the latent space. In contrast, manifolds associated with semantic content remain conserved, and the low-dimensional latent coordinates capture information such as token positions and parts-of-speech (Artetxe and Schwenk 2019; Chang et al. 2022). Taking this approach even further, cross-lingual translation models construct a single shared latent space from many languages. These models typically outperform single-pair translation models, particularly for languages with less available training data, like Swahili or Urdu (Conneau 2019).

Could the same effect hold for rare cell types? One analogy for cross-lingual latent spaces is shared embeddings of cell types across distinct organisms. Statistical alignment methods may be used to combine cell type populations across species with similar tissue groups (Fig. 3A) (Butler et al. 2018; Stein-O’Brien et al. 2019; Song et al. 2023; Tarashansky et al. 2021; Kriebel and Welch 2022; Yang et al. 2024; Pearce et al. 2025). Like understudied languages, rarer cell types benefit from integrated analysis; for example, in a joint embedding of human and mouse pancreatic cells, a combined embedding better resolves subpopulations associated with stress response during protein assembly (Butler et al. 2018). Recent works extend this concept by proposing universal cell embeddings, in which a single foundation model is trained on data spanning subjects, species, and even sequencing modalities (Rosen et al. 2023; Lopez et al. 2018; Rosen et al. 2024). The resulting embedding exhibits emergent properties, including zero-shot embedding of new species or tissues without retraining.

Mechanistic interpretability in single-cell foundation models. (A) Common architectural features and target tasks for single-cell foundation models. (B) Mechanistic interpretability methods for single cell embeddings. (Left) Intrinsic dimensionality may be calculated directly from expression profiles, or from internal activations of the model. Inset shows the intrinsic dimensionality of staged expression profiles from developing mice. Panel adapted from Ref. (Biondo et al. 2024). (Right) Sparse autoencoders are trained in an unsupervised manner to reconstruct internal activations of foundation models, by mapping activations to sparse combinations of features in a latent dictionary. Inset shows application of sparse autoencoders to the activations of the Universal Cell Embedding model on a dataset of human bone marrow. The left subpanel corresponds to annotated cell types, while the right corresponds to the decoding of a single latent unit. Panels adapted from Ref. (Schuster 2024).

As single-cell foundation models become more contextual and high-dimensional, the geometry of their embeddings may encode nontrivial biological structure, such as indirect regulatory grammars. One approach to probing embedding geometry is topological data analysis, a set of tools for analyzing high-dimensional point clouds. In language models, these methods detect cusp-like singularities that form due to polysemy (Jakubowski et al. 2020). On gene expression data, topological methods quantify the degree to which processes like developmental lineages produce low-dimensional manifolds or branches (Palande et al. 2023; Korem et al. 2015). Newly-introduced robust statistical estimators of the intrinsic dimensionality of point clouds may be used to probe internal representations in artificial neural networks (Fig. 3B, left) (Facco et al. 2017). Recent results use these estimators to show that the intrinsic dimensionality of gene expression correlates with pluripotency, across diverse taxa ranging from mice to zebrafish (Biondo et al. 2024). Several approaches directly constrain representations to enforce particular topological features. For example, imposing hyperbolic structure on embeddings improves resolution of branching processes associated with differentiation (Kuang et al. 2025; Schlüter and Uhler 2025; Klimovskaia et al. 2020; Zhou and Sharpee 2021; Ding and Regev 2021; Bhasker et al. 2025).

The emerging field of mechanistic interpretability examines the reasoning and internal representations of large language models. One such approach, linear probes, trains small linear regression models to predict particular linguistic features, like parts-of-speech or subject-verb agreement, from latent states or internal activations of layers (Alain and Bengio 2016; Mamou et al. 2020; Belinkov and Glass 2019). This approach quantifies how explicitly different features are represented, and can identify where semantic versus grammatical information resides within the model. In single-cell foundation models, linear probes identify gene families that the model weighs particularly highly in making predictions, such as by highlighting inflammation and heatshock genes in immune cell datasets (Pedrocchi et al. 2024) However, linear probes typically require a supervision signal, such as a ground truth dataset showing known effects of knockdowns, motivating the need for unsupervised methods. Sparse autoencoders train shallow, wide neural networks to encode the activations of individual layers of large models (Fig. 3B, right). The width of the latent space, coupled with a strong sparsity penalty, encourages sparse autoencoders to map single concepts onto each latent dimension, thus unfolding polysemous activations in the original large model (Gao et al. 2025). In single-cell foundation models, sparse autoencoders isolate cell types that otherwise would be difficult to distinguish in embeddings (Schuster 2024).

7 Task-independence and amortization of reasoning

Word embeddings derive their utility from their independence from downstream tasks. Training frontier models typically requires access to large amounts of computing resources, with contemporary models like RoBERTa-large optimizing as many as $[eqn]$ parameters over $[eqn]$ language tokens (Liu et al. 2019). However, once trained, these models may be used as a preprocessing step for downstream tasks like sentiment classification. Single-cell technologies share a goal of identifying general representations that foreground relevant biological variables, while removing uninformative variation like batch or technical effects (Rosen et al. 2023). Embeddings thus represent one motivation for the emerging foundation model paradigm in both language modelling and single-cell analysis, which argues that large-scale pretraining on diverse datasets leads to simpler starting representations for smaller-scale tasks. Task-independent embeddings thus serve to amortize computation.

Large-scale pretrained models exhibit inference-time computation, in which they spontaneously solve new tasks without additional training (Kojima et al. 2022). For example, large language models can be prompted to produce poetry with meter and scansion that are unseen in their training corpus (Walsh et al. 2024). The underlying mechanism, in-context learning, exploits the emergent ability of large models not only to retrieve, but also process, information during inference. Inference-time symbolic reasoning appears to improve with model scale, with language models recently advancing from solving elementary-school word problems to standardized mathematics exams for undergraduates (Wu et al. 2024; Liu et al. 2024; Ahn et al. 2024). Achieving similar results for biological datasets represents a frontier for single-cell foundation models. Several recent models exhibit forms of inference-time reasoning, such as zero-shot embedding of novel cell types, prediction of protein interactions, and anticipation of responses to genetic perturbations (Cui et al. 2024; Theodoris et al. 2023). However, these tasks have unclear difficulty compared to language modeling tasks like standardized exams, leading to conflicting results regarding the efficacy of current single-cell foundation models (Kedzierska et al. 2023; Csendes et al. 2025; Ahlmann-Eltze et al. 2025; Wenteler et al. 2024). A better test may be the ability of large models to decipher the indirect, multiscale, and highly nonlinear logic of many regulatory circuits. For example, the immune system implements elaborate combinatorial receptor-ligand interactions, phosphorylation cascades, and feedback loops, in order to discriminate self from non-self antigens (Germain 2001; Chakraborty and Weiss 2014). Parsing these logical circuits is akin to solving a complex mathematical reasoning problem, requiring models that can effectively process symbolic information.

8 Conclusion: limitations of the analogy

Drawing parallels between language models and single-cell embeddings reveals shared principles in how high-dimensional spaces encode structured, context-dependent information. However, the analogy between cells and word tokens has natural limits, presaging potential limitations of foundation models for single-cell genomics.

In natural languages, ”context” arises from discrete, ordered sequences, where exact position and co-occurrence of tokens convey meaning (Harris 1954; Firth 1957). In contrast, a cell’s context arises from a web of spatial relationships, signaling interactions, lineage history, and environmental conditions—most of which are not naturally represented as ordered sequences. No two cells are exact replicates, and their surrounding biochemical and environmental context can never be fully reproduced (Stuart and Satija 2019; Hicks et al. 2018). Furthermore, unlike words in a corpus, cells cannot be resampled from the same underlying distribution without perturbing the system, limiting the robustness and stationarity assumptions of statistical analogies. A key challenge for emerging virtual cell models will thus be their ability to distill informative context in order to resolve polysemy in cell states while still finding concise representations. Other challenges include integrating diverse experimental contexts without loss of biological specificity, and capturing nonlinear regulatory logic within embedding spaces (Fang et al. 2024; Eraslan et al. 2019; Fu et al. 2025; Cui et al. 2024).

In contrast to language, where token context is explicit and uniformly structured, biological context is multiscale, incomplete, and often indirect. Moreover, while neighboring words directly inform language token context, a cell’s relevant ”neighbors” may be defined in multiple, potentially conflicting ways (physical proximity, developmental stage, functional similarity). Resolving this ambiguity requires contextual cell embeddings, integrating heterogeneous modalities such as spatial transcriptomics, proteomics, chromatin accessibility, or lineage tracing to derive a unified representation. Truly multimodal foundation models offer a potential solution, by treating auxiliary information—like gene ontologies, medical literature, or known regulatory hierarchies—on an even footing with expression data, thus decoupling modality-specific factors from informative biological variation (Hu et al. 2025; Levine et al. 2024; Rizvi et al. 2025; Theodoris et al. 2023). However, combining modalities at scale raises technical challenges: batch effects, inconsistent coverage among different modalities, and the difficulty of defining context windows across different samples (Armingol et al. 2021; Bastidas-Ponce et al. 2019). Even if relevant auxiliary information is available, its incorporation into embeddings can amplify biases in the experimental design, leading to overfitting to specific tissue types, organisms, or experimental protocols (Rosen et al. 2023; Cui et al. 2024). Identifying such effects will be necessary in future virtual cell models, and represents an area where mechanistic interpretability and low-dimensional manifold discovery may prove informative.

The cell token analogy also breaks down when considering the dynamical nature of biological systems. In languages, dynamic embeddings model variability in token meaning without altering the underlying corpus. Yet in biology, a cell’s ”meaning” irreversibly changes over time through differentiation, signaling, senescence, and adaptation (La Manno et al. 2018; Haghverdi et al. 2016; Bergen et al. 2020). Capturing these processes requires embedding models that are temporally aware, capable of representing continuous trajectories, and robust to sparse or noisy longitudinal data. Moreover, truly contextual embeddings for cells must incorporate causal relationships—distinguishing between correlation and regulatory influence—a level of mechanistic grounding without an obvious equivalence in grammatical rules. Improved benchmarks, which test the ability of foundation models to parse complex and indirect regulatory logic, will help help transform cell representations from descriptive maps into predictive, reasoning-ready representations for biology.

Bibliography130

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adler M , Kohanim YK, Tendler A et al Continuum of gene-expression profiles provides spatial division of labor within a differentiated cell type. Cell Syst 2019;8:43–52.e 5.30638811 10.1016/j.cels.2018.12.008 · doi ↗ · pubmed ↗
2Ahlmann-Eltze C , Huber W, Anders S. Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines. Nat Methods 2025;22:1657–61.40759747 10.1038/s 41592-025-02772-6PMC 12328236 · doi ↗ · pubmed ↗
3Ahn J , Verma R, Lou R et al Large language models for mathematical reasoning: Progresses and challenges. ar Xiv preprint. ar Xiv: 2402.00157, 2024, preprint: not peer reviewed.
4Alain G , Bengio Y. Understanding intermediate layers using linear classifier probes. In: The Fourth International Conference on Learning Representations (ICLR), 2016.
5Allen C , Balazevic I, Hospedales T. What the vec? towards probabilistically grounded embeddings. Adv Neural Inf Process Syst 2019:32.
6Armingol E , Officer A, Harismendy O et al Deciphering cell–cell interactions and communication from gene expression. Nat Rev Genet 2021;22:71–88.33168968 10.1038/s 41576-020-00292-x PMC 7649713 · doi ↗ · pubmed ↗
7Arora S , Li Y, Liang Y et al Random walks on context spaces: towards an explanation of the mysteries of semantic word embeddings. ar Xiv preprint. ar Xiv: 150203520, 2015:385–399, preprint: not peer reviewed.
8Arora S , Li Y, Liang Y et al A latent variable model approach to pmi-based word embeddings. TACL 2016;4:385–99.