# The cell as a token: high-dimensional geometry in language models and cell embeddings

**Authors:** William Gilpin

PMC · DOI: 10.1093/bioinformatics/btaf595 · Bioinformatics · 2025-10-30

## TL;DR

This paper compares how language models and cell embeddings use high-dimensional spaces, suggesting insights from language models can improve cell analysis.

## Contribution

It introduces a novel perspective by linking language model advancements to single-cell data analysis techniques.

## Key findings

- Token context influences embedding space geometry in both language and cell data.
- Low-dimensional manifolds affect the robustness and interpretation of embedding spaces.
- Language model techniques like interpretability probes can enhance virtual cell model training.

## Abstract

Single-cell sequencing technology maps cells to a high-dimensional space encoding their internal activity. Recently-proposed virtual cell models extend this concept, enriching cells’ representations based on patterns learned from pretraining on vast cell atlases.

This review explores how advances in understanding the structure of natural language embeddings informs ongoing efforts to analyze single-cell datasets. Both fields process unstructured data by partitioning datasets into tokens embedded within a high-dimensional vector space. We discuss how the context of tokens influences the geometry of embedding space, and how low-dimensional manifolds shape this space’s robustness and interpretation. We highlight how new developments in foundation models for language, such as interpretability probes and in-context reasoning, can inform efforts to construct cell atlases and train virtual cell models.

Code is available at https://github.com/williamgilpin/celltoken.

## Full-text entities

- **Genes:** CD4 (CD4 molecule) [NCBI Gene 920] {aka CD4mut, IMD79, Leu-3, OKT4D, T4}, CD8A (CD8 subunit alpha) [NCBI Gene 925] {aka CD8, CD8alpha, IMD116, Leu2, p32}
- **Diseases:** inflammation (MESH:D007249)
- **Species:** Danio rerio (leopard danio, species) [taxon 7955], Homo sapiens (human, species) [taxon 9606], Mus musculus (house mouse, species) [taxon 10090]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12619638/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12619638/full.md

## References

130 references — full list in the complete paper: https://tomesphere.com/paper/PMC12619638/full.md

---
Source: https://tomesphere.com/paper/PMC12619638