ST-Align: A Multimodal Foundation Model for Image-Gene Alignment in Spatial Transcriptomics
Yuxiang Lin, Ling Luo, Ying Chen, Xushi Zhang, Zihui Wang, Wenxian, Yang, Mengsha Tong, Rongshan Yu

TL;DR
ST-Align is a novel multimodal foundation model that aligns pathological images with genomic data in spatial transcriptomics by incorporating spatial context and multi-scale alignment, improving analysis and reducing costs.
Contribution
It introduces the first foundation model for spatial transcriptomics that deeply integrates image and gene data with spatial context through a novel pretraining framework.
Findings
Outperforms existing methods in zero-shot and few-shot tasks
Pretrained on 1.3 million spot-niche pairs
Enhances understanding of tissue architecture
Abstract
Spatial transcriptomics (ST) provides high-resolution pathological images and whole-transcriptomic expression profiles at individual spots across whole-slide scales. This setting makes it an ideal data source to develop multimodal foundation models. Although recent studies attempted to fine-tune visual encoders with trainable gene encoders based on spot-level, the absence of a wider slide perspective and spatial intrinsic relationships limits their ability to capture ST-specific insights effectively. Here, we introduce ST-Align, the first foundation model designed for ST that deeply aligns image-gene pairs by incorporating spatial context, effectively bridging pathological imaging with genomic features. We design a novel pretraining framework with a three-target alignment strategy for ST-Align, enabling (1) multi-scale alignment across image-gene pairs, capturing both spot- and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSingle-cell and spatial transcriptomics
