OmniCellTOSG: The First Cell Text-Omic Signaling Graphs Dataset for Graph Language Foundation Modeling
Heming Zhang, Tim Xu, Dekang Cao, Shunning Liang, Guntaas Shergill, Nicholas Hadas, Lars Schimmelpfennig, Levi Kaster, Di Huang, Guangfu Li, S. Peter Goedegebuure, David DeNardo, Li Ding, Ryan C. Fields, J Philip Miller, Pirooz Eghtesady, Carlos Cruchaga, William Buchser

TL;DR
This paper introduces a new dataset and model that combine text, gene data, and signaling networks to improve biomedical research and precision medicine.
Contribution
The novel Text-Omic Signaling Graph (TOSG) unifies textual knowledge, omic data, and signaling networks for foundation modeling.
Findings
OmniCellTOSG includes half a million TOSGs from 80 million single-cell RNA-seq profiles.
CellTOSG-FM outperforms existing omic models in downstream tasks and provides interpretable insights.
Abstract
With the rapid growth of large-scale single-cell omic datasets, omic foundation models (FMs) have emerged as powerful tools for advancing research in life sciences and precision medicine. However, most existing omic FMs rely primarily on numerical transcriptomic data by sorting genes as sequences, while lacking explicit integration of biomedical prior knowledge and signaling interactions that are critical for scientific discovery. Here, we introduce the Text-Omic Signaling Graph (TOSG), a novel data structure that unifies human-interpretable biomedical textual knowledge, quantitative omic data, and signaling network information. Using this framework, we construct OmniCellTOSG, a large-scale resource comprising approximately half million meta-cell TOSGs derived from around 80 million single-cell and single-nucleus RNA-seq profiles across organs and diseases. We further develop…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSingle-cell and spatial transcriptomics · Bioinformatics and Genomic Networks · Biomedical Text Mining and Ontologies
Main
1
The human organism comprises ~37.2 trillion cells that arise from a single zygote and share a common genome, yet acquire specialized identities through context-dependent signaling. Such signaling is orchestrated by transcriptional programs, protein abundance and modification, and protein–protein interactions, and is further conditioned by age, sex, diet, environmental exposures, and disease state. Despite decades of discovery, major gaps persist: system-level, cell-resolved inventories of signaling entities and edges; quantitative models of network rewiring across lifespan and pathology; principled detection of disease-relevant subpopulations and their intercellular crosstalk; and actionable strategies to perturb these networks to prevent or reverse disease. Single-cell and single-nucleus RNA sequencing (sc/snRNA-seq) now provide transcriptome-wide measurements at cellular resolution, enabling delineation of cell types/subtypes in healthy and diseased tissues and the study of signaling interactions within niches or microenvironments. Large-scale efforts, such as the CZ CELLxGENE [1, 2], the Human Cell Atlas [3], the Brain Cell Atlas [4], and numerous disease-focused studies [5, 6], have generated hundreds of millions of profiles that support systematic interrogation of signaling. These resources make it feasible to ask not only which genes are active, but how groups of genes/proteins with distinct abundance levels coordinate to realize specific biological functions across diverse cellular contexts.
Currently, foundation models trained via self-supervised objectives have transformed representation learning. Most foundation single-cell models operate on expression vectors and typically do not incorporate explicit pathway structure, including SCimilarity [7], GeneFormer [8], scGPT [9], scFoundation [10], and scCello [11]. Despite recent progress, these approaches generally omit explicit modeling of signaling graphs, limiting inference of dysfunctional pathways and decoding of graph-structured signaling patterns across conditions. Recent progress on training graph foundation models has explored masked reconstruction objectives within masked graph modeling, with node-masking methods such as GraphMAE [12] as representative examples. Systematic analysis suggests that masking edges rather than nodes yields stronger performance on structure-sensitive tasks, including link prediction and topology recovery, and better captures relational patterns [13]. This is particularly salient for cellular signaling, where functional meaning arises from interaction topology rather than isolated node attributes. Interpreting these structure-dependent mechanisms typically requires both biomedical prior knowledge and topological information, yet existing models remain limited in both aspects: LLMs often struggle with domain-specific biomedical reasoning and may produce hallucinated or unreliable outputs [14], whereas GNNs can be limited in modeling complex higher-order relational structures [15, 16]. In addition to these limitations, purely numeric omicbased foundation models typically treat molecular measurements as isolated features and rarely incorporate human-interpretable textual biomedical priors and signaling-network context, which limits mechanistic interpretability and hypothesis-driven discovery. Prior work suggests that integrating biologically grounded knowledge graphs with quantitative omic features can improve predictive accuracy [17] and strengthen mechanistic reasoning by explicitly capturing cellular interactions [18]. Collectively, these challenges motivate the development of a unified representation that jointly incorporates textual priors, omic evidence, and signaling topology.
In this study, for the first time, we introduce (i) Text–Omic Signaling Graphs (TOSGs), a novel data format that unifies textual biological priors (e.g., gene/protein functions, mechanisms) with numerical omic data to support graph-based interpretation of cell signaling; (ii) OmniCellTOSG, a large-scale biomedical AI resource aggregating approximately 80 million sc/snRNA-seq profiles across tissues, cell types, diseases, ages, sexes, and related attributes, providing a comprehensive data foundation to support the development and benchmarking of next-generation AI foundation models for scientific discovery at an expert level; and (iii) CellTOSG Foundation Model (CellTOSG-FM), a multi-modal graph language foundation model that couples textual biological priors and numerical omic evidence with topological signaling network over TOSGs and cross-modalities encoders to augment graph representation learning and to support downstream tasks, including cell-type annotation, cell attribute classification, and signaling inference with interpretable graph rationales (iv) All the data and code are publicly accessible. OmniCellTOSG dataset is accessible at: huggingface.co/datasets/FuhaiLiAiLab/OmniCellTOSG_Dataset and CellTOSG-FM code is available at: github.com/FuhaiLiAiLab/OmniCellTOSG
Results
2
OmniCellTOSG ecosystem overview
2.1
We present an integrated ecosystem that couples a large, knowledge-grounded single-cell resource with reproducible tooling and a multi-modal graph language foundation model (see Figure 1). OmniCellTOSG aggregates approximately million) million single-cell and single-nucleus RNA sequencing profiles into million) representative meta-cells using the archetypal analysis framework implemented in SEACells [19]. This aggregation preserves biological diversity across tissues, diseases, age groups, and experimental conditions (Figure 1a). Based on the resulting meta-cell transcriptomic expression matrices, we next construct Text–Omic Signaling Graphs (TOSGs) by mapping the transcriptomic entities to transcript nodes and introducing corresponding protein nodes, with edges linking transcript entities to their associated proteins. The resulting graph therefore contains a total of nodes, where (see Section 4.1 for details). These graphs integrate quantitative omics measurements with curated biological knowledge from BioMedGraphica on the vertex set (see Section 4.2). Formally, the overall entity set is defined as , with , thereby linking molecular signals with established biomedical prior knowledge. TOSG supports both matched and virtual entities and records intra-cell and nucleus-level relations, enabling graph-structured signaling beyond expression vectors alone (see Figure 1b). For each entity, numerical omics features are used to form a unified representation. Transcript nodes contain measured transcriptomic expression values, while virtual protein nodes are zero-initialized because no proteomic measurements are available. These features are assembled into a global omics feature matrix . In addition to the transcriptomic and virtual proteomic features, we incorporate an auxiliary textual annotation dataset, , which provides complementary semantic information of entity name, description and biosequences (i.e., RNA sequences and protein sequences) for each node (see Section 4.2 and Figure 1c for details). Moreover, CellTOSG_Loader provides a NumPy-ready query–load–balance pipeline that constructs stratified, unbiased cohorts across user-specified facets (cell type, tissue, disease, data source, age/sex, etc.), mitigates class imbalance for pretraining and downstream evaluation. In addition, CellTOSG-FM integrates textual biological priors and omic features through cross-modal encoders. The graph encoder then performs message propagation within cells and across TOSGs to produce fused representations that enable cell-type annotation, disease classification, signaling-pathway inference, and drug-response prediction, together with interpretable subgraph rationales (see Section 4.4.1 and Figure 1c for details). Collectively, these components constitute a scalable, mechanism-focused framework that standardizes data ingestion, supports fair and reproducible experimentation, and facilitates knowledge-augmented modeling of cellular signaling at scale. The subsequent sections detail each component.
OmniCellTOSG Dataset
2.2
We introduce OmniCellTOSG, a large-scale single-cell resource that integrates transcriptomic profiles from CellxGene [1, 2], the Brain Cell Atlas [4], GEO [20], Single Cell Portal [21], and the Human Cell Atlas [22], paired with rich textual annotations spanning diverse tissues and disease states. Starting from 79,195,364 cells, we performed rigorous preprocessing—including quality control, normalization, and harmonization of organ/tissue and disease labels. Cells were aggregated into meta-cells using SEACells [19] and coupled with prior biological knowledge from BioMedGraphica [23] to assemble Text–Omic Signaling Graphs (TOSGs), yielding a curated set of 395,317 meta-cells. Attribute sets were standardized to the Cell Ontology [24] (766 cell types across 65 tissues) and disease annotations were mapped to the BioMedGraphica nomenclature (140 disease states), with remaining fields normalized for retrieval metadata. Following profile harmonization, transcriptomic data were linked to transcript entities and their downstream protein counterparts in BioMedGraphica to construct TOSGs with both matched and virtual entities, capturing nucleus-level and intra-cell relationships. In total, the graph comprises 533,458 entities and 16,637,405 relations (152,585 internal interactions and 16,484,820 protein–protein interactions). A high-level overview of the integrated resource, OmniCellTOSG, is presented in Figure 2b, and the full data-processing methodology is detailed in Section 4.1.
To ensure reproducibility and a model-ready data format, we release CellTOSG_Loader (Section 4.3), which transforms user-specified parameters into executable queries, loads matched subsets, and performs stratified cohort balancing to mitigate confounding. The loader further applies platform-aware and sc/snRNA-aware batch correction via ComBat–seq [25, 26] to reduce variance across data sources and profiling platforms (Figure S1). As an illustrative use case, for an Alzheimer’s disease (AD) versus control comparison, the loader matches the control cohort to the AD distribution over sex, age categories, and cell-type composition. Figure 2a outlines the balancing workflow, and Figure 2g shows before/after distributions for AD versus matched normals.
CellTOSG-FM Construction and Pretraining
2.3
We pretrained the model, , in a self-supervised manner on a subset , without using any metadata from the attribute set . As shown in Figure 3, sequence information is encoded with DNA-GPT [27] for RNA sequences (thymine substituted by uracil ) and ProtGPT2 [28] for protein sequences. These encoders provide high-capacity, transferable representations learned from large-scale genomic and proteomic corpora, and have demonstrated strong generalization on diverse downstream biological tasks. Because transcript and protein entities in BioMedGraphica are stable and reused across samples, we freeze the sequence encoders to eliminate redundant computation, reduce overfitting, and ensure reproducibility across runs; users may substitute alternative sequence language models if desired. A trainable omic encoder then maps numerical omic measurements into the same latent space, and a cross-modality encoder integrates the textual/sequence priors with the quantitative omic evidence to form unified entity representations (Figure 3a). This design allows the model to leverage complementary information sources, including semantics and biochemistry from sequences and context-specific variation from omics, within a single representational framework.
Furthermore, we incorporate topological structure by encoding the latent representations with graph encoders. Concretely, messages are propagated to protein entities in two coupled stages: an internal message-passing step that aggregates signals within transcript–protein pairs at the nucleus level, followed by a global propagation step that diffuses information within the cell via the protein–protein interaction topology. The latter is trained with stochastically masked edges (Figure 3b–c), optimizing a joint objective that combines edge reconstruction with a degree-regularization term to calibrate node centrality. Consistent, monotonic reductions in the training objective are observed, and held-out edge recovery improves throughout optimization; the degree-oriented auxiliary term further sharpens hub–periphery structure and stabilizes learning (Figure 3d). Using an edge-masking ratio of 10^−5^ for the self-supervised objective [13], the model reconstructs about 80% of masked edges, and attains an AUC near 0.85 when it converges (see Figure S2). Notably, these outcomes are achieved when pretraining on 5% of OmniCellTOSG for , underscoring the sample efficiency of the approach and its suitability under limited pretraining budgets. Additional details of the CellTOSG-FM pretraining protocol are provided in Section 4.4.
Downstream Tasks Based on CellTOSG-FM
2.4
After pretraining the foundation model CellTOSG-FM, we preserve the model architecture together with its pretrained parameters as a transferable initialization for downstream adaptation. For each downstream task, we employ the CellTOSG_Loader to extract and organize the task-specific datasets for fine-tuning. In our downstream tasks, lung adenocarcinoma (LUAD), atrial fibrillation (AF) and systemic lupus erythematosus (SLE) cohorts are sourced from OmniCellTOSG. We additionally evaluate on an external Alzheimer’s disease (AD) single-cell cohort from the GSE129308 project, which is available across both CELLxGENE and Brain Cell Atlas and was held out from OmniCellTOSG to enable an independent evaluation (see Section 5.4 for processing details). In general, the extracted samples/cells are first embedded using the pretrained foundation model , together with the downstream omic and text encoders, to generate integrated feature representations. Subsequently, gene-level information for each sample/cell is projected into a latent embedding space, denoted as , which serves as the input for task-specific predictors and facilitates efficient adaptation across diverse biological tasks (see Figure 4a and Section 4.5 for details).
CellTOSG-FM Improves Cell Type Annotation
2.4.1
Using cell-specific embeddings generated from CellTOSG-FM, downstream omic and text encoder and downstream message propagation via GNN layers, the downstream cell-type decoder will be applied to predict the cell-type (Figure 4a), and we evaluated annotation performance here on four disease cohorts (AD, LUAD, SLE, and AF). For each cohort, CellTOSG_Loader was used for cohorts sourced from OmniCellTOSG, while an analogous pipeline was applied to the external AD cohort, to construct balanced downstream datasets by matching the empirical distributions over cell types, thereby mitigating class-imbalance effects. To ensure consistent and comparable evaluation across tasks while reducing computational overhead, we further subsampled approximately 1,000 meta-cells for each task-specific dataset. To avoid potential data leakage, samples were partitioned into training and testing sets using donor identity as the splitting criterion, thereby ensuring donor-level independence between the two datasets. The resulting splits typically allocated approximately 20–40% of samples to the test datasets.
Across all four cohorts, CellTOSG-FM matched or exceeded strong baselines (including DNN [29, 30], GCN [31], GAT [32], UniMP [33], scGPT, scFoundation, and scCELLO) with consistent gains in every disease (see Figure 4b and Table S4 for overall performances and Figure 4d–g for more details). These improvements indicate that integrating biological textual priors with numerical omic evidence via a cross-modality encoder, together with knowledge-graph–grounded topology in the graph encoder, yields representations that transfer robustly to disease-specific annotation without task-specific tuning.
Heatmaps derived from the learned cell representations demonstrate clearer block structure and well separated manifold on cell types. Figure 4c illustrates the latent embedding spaces for AF and SLE. After restricting the analysis to the 5,000 highest-variance genes and arranging cells by cell type, distinct cell populations occupy clearly separated regions of the latent space in both datasets, demonstrating the effectiveness of CellTOSG-FM in capturing cell-type-specific differences. UMAP projections show that CellTOSG-FM produces compact, well-separated clusters whose boundaries closely align with reference labels, whereas alternative methods exhibit fragmented clusters and label mixing for several closely related types (Figure 4d–g). The accompanying confusion matrices display stronger diagonal dominance and fewer systematic off-diagonal errors for CellTOSG-FM. Collectively, these analyses demonstrate that the fused text–omic, graph-aware embeddings produced by CellTOSG-FM deliver higher annotation accuracy and cleaner class separability across diverse disease contexts than single-modality or topology-agnostic baselines.
CellTOSG-FM Enhances Cell Classification Accuracy
2.4.2
We extracted disease-specific datasets using CellTOSG_Loader for OmniCellTOSG cohorts and applying an analogous pipeline to the external AD cohort, balancing cohorts by matching empirical cell-type distributions and thereby mitigating class-imbalance effects. To test whether the fused text–omic, graph-aware representations from CellTOSG-FM support accurate prediction of cell-level attributes, we trained task-specific decoders on top of cell embeddings (Figure 4a). Two evaluation settings were considered: disease status (disease vs. normal) in AD, LUAD, SLE and AF sampled dataset; sex classification (male vs. female) in AD and AF sampled dataset. Across all cohorts, CellTOSG-FM achieved the highest accuracies relative to strong baselines (GCN, GAT, DNN, UniMP) (Figure 4b and Tables S5–S6). These gains were obtained without task-specific architectural changes, indicating that integrating biological textual priors with numerical omic evidence, together with knowledge-graph–grounded topology, yields representations that generalize effectively to diverse attribute-prediction tasks.
To assess model robustness, we examined sex classification within AF dataset, stratified by age groups and major cell types (cardiac muscle cell, cardiac blood vessel endothelial cell, fibroblast cell, adipocyte cell, mesothelial cell, and others) (see Table S7 and Table S8). Heatmaps of the learned gene hyperspace embeddings, restricted to the 5,000 highest variance genes, reveal consistent separation patterns that remain evident when cells are stratified by age group and cell type (Figure S3). Collectively, these analyses indicate that CellTOSG-FM enables accurate and robust predictions across cohorts and biological strata, while the embedding spaces exhibit clear and stable group-wise separation, providing additional structural evidence of model robustness.
CellTOSG-FM Is Interpretable to Rank Targets and Signaling Networks
2.5
To illustrate interpretability, signaling targets and pathways were inferred per sample and summarized at the cohort level (Figure 5). Latent representations, , were translated into pairwise affinities between protein entities and then constrained to the protein–protein interaction (PPI) network, yielding edge weights that respect known biology. Node importance was computed by aggregating the strengths of incident edges and modulating by corresponding molecular signals, producing saliency scores that reflect both connectivity and activity. Cohort-level subgraphs were formed from top-ranked genes and pruned to eliminate spurious star-like branches while preserving informative connections, resulting in compact, connected networks aligned to each disease context (see Figure 5a and Section 4.5 for details).
For each sample, directed edge weights between BioMedGraphica nodes were first aggregated and mapped to their corresponding gene names. Edge weights, collected from the attention-based model, sharing the same source–target gene pairs were summed within each sample, reciprocal directions were averaged to obtain undirected edge weights, and edges from each individual sample within the same group were combined by aggregating weights of identical gene pairs across samples (Figure 5a). From the attention-based undirected graph, we derived node attention by averaging the attention weights of all edges incident to each node. In parallel, node expression value was derived by selecting, for each gene, the transcript index with the highest mean (min–max normalized) expression across samples, and group-wise average expressions and Mann–Whitney U -values of comparing disease and control groups were then calculated for each node. Afterwards, the importance of nodes was ranked by using a node importance score [34] defined as the product of node attention and averaged expression of the corresponding group. Subsequently, the most important nodes per group were retained. Then, resulting signaling subgraph was refined by keeping only the largest connected component and pruning star-like leaf artifacts while preserving at least leaves per node, preferentially retaining those with significant -values or otherwise the highest-weight neighbors [35].
Fig 5a summarizes the pipeline from retrieved disease cohorts to targets and signaling networks subgraphs. Fig 5b–d show results for Alzheimer’s disease, where the inferred targets and signaling networks network recovers a compact set of connected signaling modules (red) with multiple statistically supported genes (purple outlines; ), and functional enrichment recapitulates key neurodegenerative processes (Figure 5c–d). Enrichment analysis highlights synapse-associated and cytoskeleton-related programs, including synapse and cytoskeletal protein binding, consistent with extensive evidence that early AD pathogenesis involves synaptic vulnerability and disrupted actin/cytoskeletal regulation [36–38]. Calcium-related molecular functions and pathways are also well represented, including calcium ion transmembrane transporter activity, high voltage-gated calcium channel activity, and Calcium signaling pathway, supporting the long-standing Ca^2+^ dysregulation framework in AD and implicating VGCC/NMDAR-linked signaling as a mechanistic contributor to synaptic failure and neurotoxicity [39–41]. Consistently, synaptic transmission terms (Transmission across Chemical Synapses and Unblocking of NMDA receptors, glutamate binding and activation) align with prior work linking aberrant glutamatergic/NMDAR signaling to AD synaptic dysfunction [42, 43]. Trafficking-related biology is strongly supported by enrichment of endocytosis, vesicle-mediated transport in synapse, Endocytosis, and Membrane Trafficking, supported by evidence that endocytosis and vesicle recycling defects are early and mechanistically relevant to AD [44, 45]. Moreover, enrichment of autophagy and lysosome aligns with reports of autophagy–lysosome dysfunction in AD [46, 47]. Finally, kinase-centric and signaling-related terms (kinase activity and intracellular signal transduction) are consistent with evidence that dysregulated neuronal signaling cascades contribute to AD progression [48, 49], suggesting that the inferred signaling pathways form a compact, connected network consistent with reference neurodegeneration/AD pathway frameworks.
Figure 5e–g show the inferred LUAD targets and signaling networks network and its functional enrichment. The enrichment results are dominated by antigen presentation and immune effector programs, with additional checkpoint and growth/angiogenesis signaling. Enrichment of antigen processing/presentation pathways and functional categories (e.g., peptide antigen binding, TAP binding, and Antigen processing and presentation) is consistent with established mechanisms by which altered HLA/APM function shapes tumor immune evasion and immunotherapy responsiveness in lung cancer [50, 51]. In parallel, enrichment for cytotoxic immune pathways (Natural killer cell mediated cytotoxicity and T cell receptor signaling) aligns with the central role of NK/T-cell–mediated tumor surveillance and effector function within the LUAD tumor microenvironment [52]. Checkpoint and cytokine signaling (PD-L1 expression and PD-1 checkpoint pathway in cancer together with Interferon gamma signaling) aligns with prior evidence that IFN- –driven activation programs can also promote adaptive immune resistance via PD-L1 upregulation [53–55]. Finally, enrichment of Signaling by Receptor Tyrosine Kinases and the VEGFA–VEGFR2 Pathway is compatible with canonical growth and angiogenesis programs in NSCLC, which are also known to interact with immune regulation in the tumor microenvironment [56–58].
Across diseases, the importance score distributions highlight biologically coherent modules, suggesting that the attention-derived, PPI-constrained geometry together with feature-scaled importance scores helps capture disease-associated signaling patterns that are compact, connected, and functionally interpretable.
Dicussion
3
Tissue-level and single-cell omic resources are being generated at unprecedented scale to interrogate disease pathogenesis—the core of precision medicine. Graph neural networks (GNNs) have been widely used to integrate molecular measurements with interaction knowledge for target identification and pathway inference [34, 59–61]. Nevertheless, despite strong predictive performance, prevailing graph-based approaches that operate on numeric, expression-centric signaling graphs capture only part of the scientific discovery workflow: they often underutilize the rich, human-interpretable priors encoded in biological text and curated knowledge bases. To address this gap, this work introduces a three-part ecosystem that unifies data, tooling, and modeling. First, OmniCellTOSG is a large-scale single-cell text–omic signaling graph resource whose Text–Omic Signaling Graphs (TOSGs) couple text-attributed biological knowledge with numerical gene/protein abundance, enabling graph-structured decoding of cellular signaling across tissues, diseases, ages, and conditions. Second, CellTOSG_Loader provides a NumPy-ready query–load–balance pipeline that constructs stratified, unbiased cohorts across user-specified facets (e.g., cell type, tissue, disease, data source, age/sex), mitigates class imbalance, and applies batch correction, standardizing ingestion and ensuring fair, repeatable experimentation. Third, the CellTOSG Foundation Model (CellTOSG-FM) integrates a graph-language architecture that jointly encodes biological textual priors, quantitative omic measurements, and signaling topology over TOSGs. By enabling message passing on knowledge-grounded graphs while aligning cross-modal representations, the model learns structure-aware embeddings that support downstream tasks including cell-type annotation, cell-attribute classification, and signaling inference with interpretable subgraph rationales.
These properties position TOSGs as a natural substrate for foundation-model training, enabling the learning of broadly transferable model of cellular signaling. Pretraining CellTOSG-FM on massive, heterogeneous TOSG corpora from OmniCellTOSG via self-supervised learning, which leverages edge-masked reconstruction that emphasizes signaling network structure, yields broadly transferable model of signaling patterns and provides robust bases for task-specific adaptation, outperforming disease- or dataset-specific pipelines that risk bias and overfitting. The OmniCellTOSG dataset is openly available in a PyTorch-friendly format, lowering barriers to reproducible benchmarking and catalyzing community development of graph language foundation models for precision medicine over cellular systems. Together, OmniCellTOSG, CellTOSG_Loader, and CellTOSG-FM establishes a scalable, mechanism-oriented framework for cell type annotation, disease classification, subtype delineation, and targets and signaling networks graph inference. Ongoing curation continues to expand its coverage across diseases, tissues, sex, age, and diverse experimental conditions, enabling improved interrogation of complex signaling programs and the prioritization of actionable perturbations, including candidate drugs and rational combinations that target dysfunctional nodes and pathways.
Methods
4
Data Collection and Preprocessing
4.1
The dataset was compiled primarily from three large sources, with additional cohorts added to broaden tissue and disease coverage (collection procedures in Section 5.1). From CellxGene, we obtained over 71 million single cells/nuclei across 65 human tissues and 125 disease studies in H5AD AnnData format [1, 2]; from the Brain Cell Atlas, over 7 million human brain single cells spanning 21 disease types [4, 62]; and from GEO, four studies contributing over 650,000 cells to fill underrepresented indications. We further integrated the Hepatitis Atlas to include hepatitis C virus infection data (over 7,000 cells) and the Human Cell Atlas to expand pancreas coverage with over 98,000 cells and three additional disease conditions. All datasets were converted to a unified H5AD schema to support a standardized preprocessing workflow. The resulting preprocessed resource comprises over 79 million high-quality cells covering 762 cell types, with samples organized and split by source, coarse- and fine-grained tissue labels, disease, and suspension type.
To mitigate the inherent sparsity and noise in sc/snRNA-seq data, we adopt a meta-cell strategy based on the SEACells algorithm[19]. Our approach is designed to ensure consistency across datasets from diverse sources by employing uniform preprocessing, feature selection, and dimensionality reduction procedures before meta-cell aggregation. Let the raw data be represented by , where denotes the cell, and is the number of cells collected from various data resources and is the number of elements in transcript entity set . For computational demands, raw data files (stored in H5AD format) are partitioned into subsets of no more than 50,000 cells. For datasets requiring normalization, we first apply total count normalization by scaling UMI counts of each cell to a fixed total of 10,000, followed by a log1p transformation to stabilize variance. In addition, Uniform feature selection is performed by identifying the top 1,500 highly variable genes from each dataset. We then apply Principal Component Analysis (PCA[63]) with 50 components to reduce dimensionality while preserving essential variance. Based on the PCA-reduced features, a K-Nearest Neighbor (KNN[64]) graph is constructed to maintain the underlying structural relationships among cells. Meta-cell generation is performed using the SEACells algorithm. With a fixed aggregation size of cells per meta cell, SEACells first measures cell-to-cell similarity and then decomposes the resulting structure via archetypal analysis. Cells near the convex hulls of the data distribution are grouped together, yielding a new set of meta cells denoted by , where represents a meta-cell.
Correspondingly, the associated attributes (e.g., sex, cell_type, development_stage, tissue, disease_status, etc.) for the meta cells are computed by aggregating the raw cell attributes through majority voting, resulting with (see Figure 1a). After all meta-cells are constructed, we group the data by source and tissue (coarse-grained), read the corresponding meta-cell H5AD files, and map gene identifiers to the BioMedGraphica framework (see Section 4.2). The expanded expression matrices are then serialized into NumPy shards of 10,000 samples per file. For each sample, we record the relative path of the NumPy matrix (matrix_file_path) and the corresponding row index in the matrix file (matrix_row_idx); these pointers are stored together with the H5AD .obs fields and saved in CSV format for downstream processing. Because the datasets originate from diverse sources, the nomenclature of attributes such as cell type, disease, development stage, and sex varies substantially. To standardize cell type, we built a mapping pipeline using the unique cell type values extracted from all datasets and a cell type mapping table (CMT) based on Cell Ontology (CL) database. All synonym fields in the cell type mapping table were expanded to generate a candidate dictionary linking each synonym to its CMT ID and the corresponding CL label. Before matching, anchor rules were defined to handle specific terms and generic placeholders such as “unknown” or “unclassified” were ignored. For every original cell type, exact matching was first attempted; if not found, fuzzy matching (token-sort ratio) was applied to compute the best-scoring candidate. The resulting pairs contained the original term, the matched CMT term, CMT ID, CL label, and the matching score. Terms with scores lower than 100 were manually reviewed and corrected, producing a curated mapping that unified all cell type under the Cell Ontology standard. The same procedure was used for disease terms to obtain BMG disease identifiers.
The development stage values were normalized by converting free-text descriptions into approximate numeric ages (in years) using regular-expression parsing of units such as years, months, weeks, days, and Carnegie stages, followed by categorization into MeSH-based age groups (e.g., infant, child, young adult, middle aged)[65]. Each entry was also assigned a coarse birth phase label (pre-birth, post-birth, or unknown). For sex, all terms were normalized through direct mapping (e.g., f for female, m for male), with unrecognized or empty values set to unknown. Finally, by integrating the metadata with the curated mapping results, we obtain the standardized attribute set (Table S2). This set serves as retrieval metadata for CellTOSG_Loader to locate and extract the corresponding cells (Section 4.3).
OmniCellTOSG Generation
4.2
With the preprocessed single-cell transcriptomic dataset denoted as , we integrate it into the BioMedGraphica framework together with the gene-regulatory network. Using the mapping table, the transcript features are mapped to transcript entities. Specifically, each transcript element in the set is mapped and extended to the transcript-entity set . By linking transcript nodes within the network to the protein–protein interaction (PPI) graph, proteins are treated as virtual nodes, yielding the additional entity set . The overall entity set is , with . Likewise, the feature set is generated, where , and correspond to the transcriptomic and proteomic feature sets, respectively.
From the perspective of single cell side, the multi-omics can be decomposed as , where each sample resides in . Additionally, the cell label matrices set , and given that the cell label set are consistent with label for meta cells, . Beyond transcriptomic features and virtual proteomic features, an auxiliary node textual information dataset, , is incorporated. Each of those entity textual information correpsonds to the node in entity set . The , representing the entity names (e.g., HGNC symbol, Ensembl ID), , representing the entity textual descriptions (e.g., Uniprot protein description), and , representing biochemical information (i.e., RNA sequences or protein sequences). Therefore, for any entity, , it has the textual information set . And the entity textual information dataset, , enhances the graph’s expressivity, facilitating the generation of a textual-attributed transcriptomic signaling knowledge graph.
Afterwards, to construct the text–omic signaling graph , we identify relations (edges) between entities. As noted above, the vertex set is . We consider two relation types: internal signaling and PPI-based gene-regulatory signaling. Accordingly, the graph decomposes into the internal-signaling subgraph , which captures the molecular flow from transcripts to proteins, and the PPI-regulatory subgraph , which captures protein–protein interactions, with the overall edge set . By construction, with , while .
Overall, the pipeline condenses 79,195,364 raw cells into meta-cells and aligns molecular entities (transcript/protein nodes) enriched with textual and topological information with internal signaling edges with and PPI-regulatory subgraph with . Building on these components, we fuse preprocessed single-cell transcriptomic profiles with prior gene–regulatory and signaling knowledge to construct TOSGs, and we release the dataset . The resulting TOSGs provide a unified, graph-structured substrate for foundation-model pretraining and downstream tasks by coupling numeric omics measurements with textual and topological knowledge, thereby enabling structure-aware learning and interpretable signaling analysis. A comprehensive summary of organ and disease coverage is provided in Table S3.
CellTOSG_Loader Package
4.3
To enable scalable access, the feature matrix in OmniCellTOSG is partitioned row-wise into fixed-size NumPy shards, each stored as an x.npy file with lightweight metadata recording global row indices. After downloading the dataset to a local root, users employ CellTOSG_Loader (Appendix 5.3.1), which discovers the relevant shards and materializes only the requested subset, thereby avoiding full-matrix loads. Cohorts are defined via standardized metadata filters conditions (e.g., {tissue_general: brain, disease_name: Alzheimer’s Disease}) expressed over the attribute set ; the supervised objective and target field are designated by task and label_column, respectively. Given these inputs, the loader deterministically translates user arguments into a formal query, extracts the feasible subset from , and optionally applies subsampling (to optimize memory usage), class balancing, and batch correction via ComBat–seq [25, 26]. At the core of retrieval is a two-phase Stratified Retrieval Algorithm (SRA) tailored to and its named attributes as meta data. In Phase I (query-constrained extraction), a user’s conjunctive query yields the subset by enforcing that all specified attribute constraints hold. In Phase II (task-aware balancing), a task configuration specifies the balance label and control value, the exact-match covariates, and an ordered age–stage key; cases are taken from (non-control label), controls are drawn by reapplying the same filters with the label fixed to the control value, and key-stratified matching is performed: non-age covariates must match exactly within each stratum, while age differences are bounded by a tolerance along the ordered age–stage axis (with optional upsampling and discarding infeasible strata). The outcome is a stratified cohort whose non-stage covariates are balanced by construction and whose age-stage offsets satisfy . For cell-type annotation, balancing is disabled and rare types ( samples) may be upsampled for training stability. The procedure simultaneously returns the label set aligned with . The Stratified Retrieval Algorithm is described in detail below.
In the Query-Constrained Extraction phase, we let be the set of samples and the set of attributes. For each with value space , define the attribute–evaluation map . Equivalently, collect these into a single evaluation map via . A user query is a finite set of attribute constraints with nonempty admissible sets and , interpreted by the conjunctive predicate
where equals 1 when the statement is true and 0 otherwise. The selection induced by the query is the feasible set
i.e., the set of samples that satisfy all attribute-wise constraints simultaneously.
During Phase II of Task-Aware Balancing, we Let the task configuration be , where is the balance field with control value , and is the ordered tuple of match keys with designated age–stage key for some index . Define the query with any constraint on removed by
The control pool applies the same non- filters and overwrites to its control value:
The case set is drawn from the query subset:
Let and define the key map
For any key tuple , define the strata
Endow with a rank map and induced distance
Given a tolerance , the offset-admissible control pool for stratum is
Matching proceeds per stratum by first taking exact-stage controls ; if insufficient, progressively admitting offsets according to the stage order, sampling without replacement within each offset layer, and finally (if enabled) upsampling with replacement from the collected controls to meet the stratum size. Strata with no admissible controls are discarded. Let and denote the retained case rows and matched controls for successful strata. The balanced output is
where is the set of strata that achieved feasible matching. Within each retained stratum, all non-stage keys in match exactly by construction, and stage differences satisfy . Finally, the task-specific labels are obtained by restricting the label map to the returned cohort, i.e., , which is one-to-one aligned with the rows. The details of the algorithm can be checked in the Appendix 5.3.2.
Graph Language Foundation Model
4.4
CellTOSG-FM pretraining
4.4.1
Given the integrated text–omic signaling graph dataset , which comprises a single-cell text–omic signaling graph together with text–omic feature sets and , we construct a self-supervised pretraining task by sampling a subset . For edge-masking pretraining, we draw an edge mask set over the protein–protein interaction subset , where denotes the masking ratio used to occlude signaling flow along PPI edges. The foundation model is then pretrained via
, where is the entity embeddings, and is the pre-trained foundation model. In details, to merge the text-omics feature sets into unified entity embeddings, bi-encoder framework was leveraged by
, where the is the linear transformation and the can be BERT-based or other language models (LMs) and , where are encoded as entity name, textual description and biochemical embeddings. The will fuse the omic embeddings and the textual embeddings to generate .
Afterwards, the internal signaling will be propagated by using graph encoder with
, where uses message propagation via GNN layers and . Finally, with the prepared entity embedding, the foundation model will be pretrained by masking nodes with
, where denotes global signaling message propagation via GNN layers and represents the multilayer perceptron to embed the node features before the final stage of pretraining.
To update the pretraining model parameters, we adopt a reconstruction objective. Specifically, we define the set of visible edges as . The structural decoder with parameters outputs the probability of an edge between nodes and as
where are the global node embeddings for nodes and in sample ; MLP denotes a multilayer perceptron and the element-wise product. In parallel, the degree decoder with parameters predicts the node degree via
We compute the pretraining reconstruction loss as the sum of an edge-reconstruction term and a degree-reconstruction term, with and a set of sampled negative edges drawn from pairs not in . The binary cross-entropy edge losses for sample are
and the degree-reconstruction loss is
where denotes the degree of node for sample cell in the graph .
With optional weights , the per-sample objective and the dataset-averaged pretraining loss are
This objective jointly optimizes the encoder and both decoders to reconstruct the masked PPI topology and node degrees from , thereby aligning the learned embeddings with the global signaling structure.
Model Downstream Tasks
4.5
Ultimately, the objective is to use the pretrained foundation model, , that synergistically integrates the task specific incoming feature set , node descriptions , and graph topology to predict cell-specific outcomes. As to the unsupervised task, the latent embedding for the incoming feature set will be generated by
, where . For supervised learning, the foundation model will predict the cell outcomes by
, where represents the predicted cellular states, which depends on specific downstream tasks (e.g., cell type annotations or celluar condition (normal vs. disease)) and is the downstream decoder, which contains following components of encoders,
, where and represent the downstream omic encoder and downstream cross-modality encoder, and are downstream internal signaling and global signaling message propagation via GNN layers and is the linear classifier. To infer the cell-specific targets and signaling networks network for sample , an affinity matrix for measuring the attention-based edge weight will be derived from , the latent space embeddings of sample , by
, where is an attention-based similarity function. To restrict the edge weights to the protein–protein interaction (PPI) topology, we use
, where is the transformation function to trun the adjacency matrix to edge pairs in dimensions of by 2 and are cell specific PPI edge weights for sample . To calculate the node importance score for certain node in the sample , we integrate the averaged bi-directional edge weights and omic values by
, where are node importance score for node in sample . Finally, a core-extraction routine will select a compact subgraph by
, where the operator ranks nodes by to retain the top nodes and applies a branch-pruning procedure [35] to keep the top edges in , yielding the core subgraph .
Supplementary Material
1
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Megill C., Martin B., Weaver C., Bell S., Prins L., Badajoz S., Mc Candless B., Pisco A.O., Kinsella M., Griffin F., : Cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices. Bio Rxiv, 2021–04 (2021)
- 2Program C.C.S., Abdulla S., Aevermann B., Assis P., Badajoz S., Bell S.M., Bezzi E., Cakir B., Chaffer J., Chambers S., : Cz cellxgene discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data. Nucleic Acids Research 53(D 1), 886–900 (2025)
- 3Rood J.E., Wynne S., Robson L., Hupalowska A., Randell J., Teichmann S.A., Regev A.: The human cell atlas from a cell census to a unified foundation model. Nature, 1–2 (2024)
- 4Chen X., Huang Y., Huang L., Huang Z., Hao Z.-Z., Xu L., Xu N., Li Z., Mou Y., Ye M., : A brain cell atlas integrating single-cell transcriptomes across human brain regions. Nature Medicine 30(9), 2679–2691 (2024)
- 5Miller J.A., Hawrylycz M.J., Aitken M., Ariza J., Chakrabarty R., Ding S.-L., Ding Y., Ferrer R., Goldy J., Gratiy S., : Sea-ad: Scientific analysis and open access resources targeting early changes in alzheimer’s disease. Alzheimer’s & Dementia 19, 063478 (2023)
- 6Mathys H., Peng Z., Boix C.A., Victor M.B., Leary N., Babu S., Abdelhady G., Jiang X., Ng A.P., Ghafari K., : Single-cell atlas reveals correlates of high cognitive function, dementia, and resilience to alzheimer’s disease pathology. Cell 186(20), 4365–4385 (2023)37774677 10.1016/j.cell.2023.08.039PMC 10601493 · doi ↗ · pubmed ↗
- 7Heimberg G., Kuo T., De Pianto D.J., Salem O., Heigl T., Diamant N., Scalia G., Biancalani T., Turley S.J., Rock J.R., : A cell atlas foundation model for scalable search of similar human cells. Nature, 1–3 (2024)
- 8Theodoris C.V., Xiao L., Chopra A., Chaffin M.D., Al Sayed Z.R., Hill M.C., Mantineo H., Brydon E.M., Zeng Z., Liu X.S., : Transfer learning enables predictions in network biology. Nature 618(7965), 616–624 (2023)37258680 10.1038/s 41586-023-06139-9PMC 10949956 · doi ↗ · pubmed ↗
