# OmniCellTOSG: The First Cell Text-Omic Signaling Graphs Dataset for Graph Language Foundation Modeling

**Authors:** Heming Zhang, Tim Xu, Dekang Cao, Shunning Liang, Guntaas Shergill, Nicholas Hadas, Lars Schimmelpfennig, Levi Kaster, Di Huang, Guangfu Li, S. Peter Goedegebuure, David DeNardo, Li Ding, Ryan C. Fields, J Philip Miller, Pirooz Eghtesady, Carlos Cruchaga, William Buchser, Jonathan Cooper, Marco Sardiello, Patricia Dickson, Yixin Chen, Michael Province, Philip Payne, Fuhai Li

PMC · DOI: 10.21203/rs.3.rs-8774770/v1 · Research Square · 2026-02-17

## TL;DR

This paper introduces a new dataset and model that combine text, gene data, and signaling networks to improve biomedical research and precision medicine.

## Contribution

The novel Text-Omic Signaling Graph (TOSG) unifies textual knowledge, omic data, and signaling networks for foundation modeling.

## Key findings

- OmniCellTOSG includes half a million TOSGs from 80 million single-cell RNA-seq profiles.
- CellTOSG-FM outperforms existing omic models in downstream tasks and provides interpretable insights.

## Abstract

With the rapid growth of large-scale single-cell omic datasets, omic foundation models (FMs) have emerged as powerful tools for advancing research in life sciences and precision medicine. However, most existing omic FMs rely primarily on numerical transcriptomic data by sorting genes as sequences, while lacking explicit integration of biomedical prior knowledge and signaling interactions that are critical for scientific discovery. Here, we introduce the Text-Omic Signaling Graph (TOSG), a novel data structure that unifies human-interpretable biomedical textual knowledge, quantitative omic data, and signaling network information. Using this framework, we construct OmniCellTOSG, a large-scale resource comprising approximately half million meta-cell TOSGs derived from around 80 million single-cell and single-nucleus RNA-seq profiles across organs and diseases. We further develop CellTOSG-FM, a multimodal graph language FM, to jointly analyze textual, omic and signaling network context. Across diverse downstream tasks, CellTOSG-FM outperforms existing omic FMs, and provides interpretable insights into disease-associated targets and signaling pathways.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12934919/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12934919/full.md

## References

65 references — full list in the complete paper: https://tomesphere.com/paper/PMC12934919/full.md

---
Source: https://tomesphere.com/paper/PMC12934919