Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology
Minghao Han, Dingkang Yang, Linhao Qu, Zizhi Chen, Gang Li, Han Wang, Jiacong Wang, Lihua Zhang

TL;DR
This paper introduces STAMP, a novel framework that integrates spatial transcriptomics with pathology images to improve multimodal learning, leveraging large-scale spatial gene expression data for better molecular and spatial understanding.
Contribution
The paper presents STAMP, a new spatially-aware multimodal learning method that combines gene expression profiles with pathology images, supported by the largest spatial transcriptomics dataset to date.
Findings
STAMP outperforms existing models on multiple datasets
Spatial context and multi-scale info improve performance
Gene-guided training enhances representation robustness
Abstract
Recent years have witnessed remarkable progress in multimodal learning within computational pathology. Existing models primarily rely on vision and language modalities; however, language alone lacks molecular specificity and offers limited pathological supervision, leading to representational bottlenecks. In this paper, we propose STAMP, a Spatial Transcriptomics-Augmented Multimodal Pathology representation learning framework that integrates spatially-resolved gene expression profiles to enable molecule-guided joint embedding of pathology images and transcriptomic data. Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations. Incorporating spatial context and multi-scale information further enhances model performance and generalizability. To support this, we constructed SpaVis-6M, the largest…
Peer Reviews
Decision·ICLR 2026 Poster
- This is a good resource and benchmark paper for researchers working in pathology-genomics pretraining. SpaVis‑6M is a good contribution, as well as the pretrained weights and code used to pretrain on this dataset. - Experimental design is overall strong, with diverse breadth of tasks and comparisons. Table 6 and Table 7 are important ablations, which respectively show (1) loss objectives used in STAMP improve multimodal pretraining performance, (2) performance gain of pretraining on SpaVis-6M
- What is the statistical significance of STAMP improvement? In Table 2, the standard deviation of performance is enormous for the PSC, HHK, HER2+ tasks. For the HER2+ task, most models have an average MSE of ~0.9 with a standard deviation of ~0.45. - Is there a reason why STAMP was not evaluated on the HEST benchmark? - There are many works looking at multimodal alignment of pathology and ST. While STAMP was compared against BLEEP and mclSTExp, it is missing many other comparisons such as HisTo
I believe this is a novel training strategy for a patch level encoder. The technique for providing summary tokenization for tissue regions is very interesting and likely to be a path forward for more ST based vision models. While I was skeptical, the experiment showing that the model has comparable to superior performance as a feature extractor slide level biomarker prediction is fascinating. The imaging technique and quality from ST is quite different from standard WSI. While the LUAD mut
ST is not widely available and different platforms likely have different artifacts. This manuscript, while a good start, does not conclusively determine that performance gained from using data from this very spatially detailed assay outweighs performance that can be gained on much more widely available data types. The encoder training is borrowed heavily from other VLM and VGM so is not novel. Thus this is a smart re-implementation but not a novelty. In Figure 3, there is a large descrepanc
- The paper is well written and well structured. - For both modalities, histopathology & (spatial) transcriptomics, a sufficient amount of foundation models were benchmarked to set the STAMP model in to context. When possible, STAMP was evaluated in both the uni-modal as well as multi-modal setting to better compare it to current uni-modal FMs - The dataset addresses current short-comings of sparse and well structured, coherent ST data - The training setup is sophisticated and the single loss te
Weaknesses/Questions: - The Gene Encoder architecture is not explained in the main text. It is not directly clear how the Embedding(T_i) in Eq. 2) is being inferred or how it is masked. - The drawbacks of imputing missing genes in the SpaVis-6M dataset is not discussed. To which extend would this affect the training of the models? Does this introduce batch effects w.r.t. to the input of the Gene Encoder or to the target predictions? How would this impact the evaluation if a gene is predicted wi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSingle-cell and spatial transcriptomics · AI in cancer detection · Domain Adaptation and Few-Shot Learning
