WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

Yuci Liang; Xinheng Lyu; Wenting Chen; Meidan Ding; Jipeng Zhang; Xiangjian He; Song Wu; Xiaohan Xing; Sen Yang; Xiyue Wang; Linlin Shen

arXiv:2412.02141·cs.CV·August 13, 2025

WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

Yuci Liang, Xinheng Lyu, Wenting Chen, Meidan Ding, Jipeng Zhang, Xiangjian He, Song Wu, Xiaohan Xing, Sen Yang, Xiyue Wang, Linlin Shen

PDF

Open Access

TL;DR

This paper introduces WSI-LLaVA, a multimodal large language model designed for comprehensive analysis of whole slide images in pathology, supported by a new large-scale benchmark and specialized evaluation metrics.

Contribution

The paper presents WSI-LLaVA, a novel framework for gigapixel WSI understanding, along with WSI-Bench, a large-scale morphology-aware benchmark, and new metrics for pathological assessment.

Findings

01

WSI-LLaVA outperforms existing models in morphological analysis.

02

The model shows a strong correlation between morphological understanding and diagnostic accuracy.

03

Introduction of WSI-Bench and specialized WSI metrics enhances evaluation of WSI models.

Abstract

Recent advancements in computational pathology have produced patch-level Multi-modal Large Language Models (MLLMs), but these models are limited by their inability to analyze whole slide images (WSIs) comprehensively and their tendency to bypass crucial morphological features that pathologists rely on for diagnosis. To address these challenges, we first introduce WSI-Bench, a large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850 WSIs across 30 cancer types, designed to evaluate MLLMs' understanding of morphological characteristics crucial for accurate diagnosis. Building upon this benchmark, we present WSI-LLaVA, a novel framework for gigapixel WSI understanding that employs a three-stage training approach: WSI-text alignment, feature space alignment, and task-specific instruction tuning. To better assess model performance in pathological contexts, we develop two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications