# Embeddings from language models are good learners for single-cell data analysis

**Authors:** Tianyu Liu, Tianqi Chen, Wangjie Zheng, Xiao Luo, Yiqun Chen, Hongyu Zhao

PMC · DOI: 10.1016/j.patter.2025.101431 · Patterns · 2026-01-30

## TL;DR

scELMo uses large language models to analyze single-cell data efficiently, enabling tasks like cell clustering and treatment prediction with less computational power.

## Contribution

scELMo introduces a novel framework that combines pre-trained language models with single-cell data analysis for improved performance and accessibility.

## Key findings

- scELMo achieves cell clustering, batch effect correction, and cell-type annotation without training a new model.
- The method supports in silico treatment analysis and perturbation modeling through a fine-tuning framework.
- scELMo requires fewer resources and offers a scalable, interpretable solution for single-cell data analysis.

## Abstract

Foundation models (FMs) have been built to analyze single-cell data with different degrees of success. Here, we present scELMo (single-cell embedding from language models), a method for analyzing single-cell data with the help of large language models (LLMs). LLMs can generate both the description of metadata information and the embeddings for such descriptions. We then combine the embeddings from LLMs with the raw data under the zero-shot learning framework to further extend its function by using the fine-tuning framework to handle different tasks. We demonstrate that scELMo is capable of cell clustering, batch effect correction, and cell-type annotation without training a new model. Moreover, the fine-tuning framework of scELMo can help with more challenging tasks, including in silico treatment analysis or modeling perturbation. scELMo has a lighter structure and lower requirements for resources, suggesting a more promising path.

•scELMo leverages large language models to integrate knowledge•scELMo supports multiple settings•scELMo achieves superior performance with fewer resources•scELMo establishes a scalable and interpretable framework

scELMo leverages large language models to integrate knowledge

scELMo supports multiple settings

scELMo achieves superior performance with fewer resources

scELMo establishes a scalable and interpretable framework

Single-cell technologies allow scientists to measure the activity of thousands of genes in individual cells, revealing how tissues develop, age, and respond to disease. Yet, analyzing these massive datasets often demands substantial computing resources and specialized expertise. Our method, single-cell embedding from language models (scELMo), offers an accessible solution by harnessing large language models—the same artificial intelligence systems behind modern chatbots—to interpret biological information. Instead of training large models from scratch, scELMo uses pre-trained language models’ knowledge of gene functions and biological concepts to generate detailed numerical representations, or embeddings, of genes. These embeddings capture complex biological information and can be integrated with cellular data to facilitate tasks such as identifying cell types, understanding developmental processes, or exploring disease mechanisms. By uniting advances in computational linguistics and genomics, scELMo transforms language models into engines of biological discovery, expanding access to powerful single-cell analysis tools and accelerating the pace of biomedical insight.

scELMo introduces a way to analyze massive single-cell datasets by harnessing large language models to summarize biological knowledge about each gene. These summaries are transformed into mathematical embeddings that integrate with cell data, enabling efficient cell grouping, batch correction, and treatment prediction. By merging language understanding with biological data, scELMo reduces computational demands and democratizes advanced analysis—offering a faster, more accessible path to discoveries that could inform new therapies and deepen understanding of human health.

## Full-text entities

- **Genes:** EXT1 (exostosin glycosyltransferase 1) [NCBI Gene 2131] {aka EXT, LGCR, LGS, TRPS2, TTV}, GSN (gelsolin) [NCBI Gene 2934] {aka ADF, AGEL, AMYLD4}, TTTY10 (testis expressed transcript, Y-linked 10) [NCBI Gene 246119] {aka NCRNA00133, TTY10, lnc-KDM5D-4}, ANKRD1 (ankyrin repeat domain 1) [NCBI Gene 27063] {aka ALRP, C-193, CARP, CVARP, MCARP, bA320F15.2}, CPA1 (carboxypeptidase A1) [NCBI Gene 1357] {aka CPA}, ATP6 (ATP synthase F0 subunit 6) [NCBI Gene 4508] {aka ATPase6, MTATP6}, GPT2 (glutamic--pyruvic transaminase 2) [NCBI Gene 84706] {aka ALT2, GPT 2, MRT49, NEDSPM}, PALM (paralemmin) [NCBI Gene 5064] {aka PALM1}, NPPB (natriuretic peptide B) [NCBI Gene 4879] {aka BNP, Iso-ANP}
- **Diseases:** OOD (MESH:D020243), PC (MESH:C566443), immune dysregulation (OMIM:614878), HCM (MESH:D000092183), cardiomyopathy (MESH:D009202), LLMs (MESH:D007806), ascending aortic aneurysm (MESH:D000094625), DCM (MESH:D002311), aneurysm (MESH:D000783), neurodegeneration (MESH:D019636), hallucinations (MESH:D006212)
- **Chemicals:** CINEMA-OT (-), OT (MESH:C013307)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Mutations:** S13D

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12921509/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12921509/full.md

## References

115 references — full list in the complete paper: https://tomesphere.com/paper/PMC12921509/full.md

---
Source: https://tomesphere.com/paper/PMC12921509