Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

Xupeng Chen; Binbin Shi; Chenqian Le; Jiaqi Zhang; Kewen Wang; Ran Gong; Jinhan Zhang; Chihang Wang

arXiv:2604.27724·cs.AI·May 1, 2026

Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

Xupeng Chen, Binbin Shi, Chenqian Le, Jiaqi Zhang, Kewen Wang, Ran Gong, Jinhan Zhang, Chihang Wang

PDF

TL;DR

MED-VRAG introduces an iterative multimodal retrieval-augmented generation system that leverages document page images and reasoning over visual content to improve medical question answering accuracy.

Contribution

It is the first to retrieve and reason over full document page images in a multimodal RAG framework for medical QA, scaling efficiently to large datasets.

Findings

01

Achieves 78.6% accuracy across four medical QA benchmarks.

02

Retrieval improves accuracy by 5.8 points over no-retrieval baseline.

03

Iterative reasoning and memory bank contribute additional performance gains.

Abstract

Medical retrieval-augmented generation (RAG) systems typically operate on text chunks extracted from biomedical literature, discarding the rich visual content (tables, figures, structured layouts) of original document pages. We propose MED-VRAG, an iterative multimodal RAG framework that retrieves and reasons over PMC document page images instead of OCR'd text. The system pairs ColQwen2.5 patch-level page embeddings with a sharded MapReduce LLM filter, scaling to ~350K pages while keeping Stage-1 retrieval under 30 ms via an offline coarse-to-fine index (C=8 centroids per page, ANN over centroids, exact two-way scoring on the top-R shortlist). A vision-language model (VLM) then iteratively refines its query and accumulates evidence in a memory bank across up to 3 reasoning rounds, with a single iteration costing ~15.9 s and the full three-round pipeline ~47.8 s on 4xA100. Across four…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.