Semantic Context-aware mOdality fUsion Transformer (SCOUT): A Context-Aware Multimodal Transformer for Concept-Grounded Pathology Report Generation
Suryakant Singh, Saarthak Kapse, Joel Saltz, Prateek Prasanna

TL;DR
SCOUT is a novel multimodal transformer framework that integrates visual and semantic information for more accurate and clinically coherent pathology report generation from whole-slide images.
Contribution
It introduces a context-aware, concept-grounded multimodal approach that progressively refines image representations using global and semantic information during report generation.
Findings
SCOUT outperforms existing models on multiple datasets in BLEU, METEOR, and ROUGE-L scores.
It effectively integrates local histological features with slide-level context and semantic descriptors.
The framework produces more clinically coherent pathology reports.
Abstract
Whole-slide images (WSIs) present a fundamental challenge for computational pathology due to their extreme resolution, multi-scale heterogeneity, and the requirement for clinically reliable interpretation. Although recent pathology foundation models have enabled fluent report generation, they often lack clinical grounding, failing to accurately represent key diagnostic concepts and relationships observed by pathologists. This limitation arises from the difficulty of integrating heterogeneous visual evidence spanning fine-grained cellular patterns, slide-level tissue architecture, and high-level diagnostic concepts, while maintaining interpretability and clinical coherence. Here we present SCOUT: Semantic Context-aware mOdality fUsion Transformer, a context-aware concept-grounded multimodal framework for pathology report generation that enables progressive conditioning of image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
