HeteroRAG: A Heterogeneous Retrieval-Augmented Generation Framework for Medical Vision Language Tasks

Zhe Chen; Yusheng Liao; Zhiyuan Zhu; Haolin Li; Hongcheng Liu; Yanfeng Wang; Yu Wang

arXiv:2508.12778·cs.CL·May 5, 2026

HeteroRAG: A Heterogeneous Retrieval-Augmented Generation Framework for Medical Vision Language Tasks

Zhe Chen, Yusheng Liao, Zhiyuan Zhu, Haolin Li, Hongcheng Liu, Yanfeng Wang, Yu Wang

PDF

TL;DR

HeteroRAG is a novel framework that improves medical vision-language models by effectively retrieving and integrating heterogeneous knowledge sources, significantly enhancing factual accuracy and reliability in clinical tasks.

Contribution

The paper introduces HeteroRAG, a new retrieval-augmented generation framework that leverages modality-specific retrieval and multi-source knowledge alignment for medical vision-language tasks.

Findings

01

HeteroRAG achieves state-of-the-art results on 11 datasets.

02

Significant improvements in factual accuracy and reliability.

03

Effective retrieval across heterogeneous sources enhances clinical decision-making.

Abstract

Medical large vision-language Models (Med-LVLMs) have shown promise in clinical applications but suffer from factual inaccuracies and unreliable outputs, posing risks in real-world diagnostics. While RAG has emerged as a potential solution, current medical multimodal RAG systems are unable to perform effective retrieval across heterogeneous sources. The irrelevance of retrieved reports undermines the factuality of analysis, while insufficient knowledge affects the credibility of clinical decision-making. To bridge the research gap, we construct MedAtlas, which includes extensive multimodal report repositories and diverse text corpora. Based on it, we present HeteroRAG, a novel framework that enhances Med-LVLMs through heterogeneous knowledge sources. The framework introduces Modality-specific CLIPs for effective report retrieval and a Multi-corpora Query Generator for tailoring queries…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.