Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning

Wenchuan Zhang; Jingru Guo; Hengzhe Zhang; Penghao Zhang; Jie Chen; Shuwan Zhang; Zhang Zhang; Yuhao Yi; Hong Bu

arXiv:2508.02258·cs.CV·March 24, 2026

Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning

Wenchuan Zhang, Jingru Guo, Hengzhe Zhang, Penghao Zhang, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, Hong Bu

PDF

1 Models

TL;DR

Patho-AgenticRAG introduces a multimodal retrieval-augmented generation framework for pathology VLMs, leveraging a database of pathology textbook embeddings to improve diagnostic accuracy and reduce hallucinations through joint text-image retrieval and reasoning.

Contribution

It presents a novel multimodal RAG system with a pathology textbook database enabling joint text-image search and reasoning, enhancing diagnostic performance in pathology VLMs.

Findings

01

Significantly outperforms existing models in pathology diagnosis tasks.

02

Supports joint text-image retrieval for better visual evidence utilization.

03

Improves accuracy in complex diagnostic scenarios.

Abstract

Although Vision Language Models (VLMs) have shown strong generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced clinical semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text-image search, enabling direct retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
WenchuanZhang/Agentic-Router
model· 5 dl· ♡ 1
5 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.