ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

Zhuofan Shi; Peilun Jia; Baoqin Sun; Haiyang Shen; Sixiong Xie; Yun Ma; and Xiang Jing

arXiv:2605.13034·cs.CV·May 14, 2026

ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

Zhuofan Shi, Peilun Jia, Baoqin Sun, Haiyang Shen, Sixiong Xie, Yun Ma, and Xiang Jing

PDF

TL;DR

ViDR is a multimodal framework that enhances deep research reports by effectively grounding them in source visual evidence, improving evidence integration and report verifiability.

Contribution

It introduces a novel approach to utilize source figures as verifiable evidence objects and presents MMR Bench+ for evaluating visual evidence use.

Findings

01

ViDR improves report quality and evidence integration.

02

It enhances verifiability and reduces hallucinated figures.

03

Experiments outperform strong baselines.

Abstract

Recent deep research systems have improved the ability of large language models to produce long, grounded reports through iterative retrieval and reasoning. However, most text-centered systems rely mainly on textual evidence, while multimodal systems often retrieve images only weakly or generate charts themselves, leaving source figures underused as evidence. We present ViDR, a multimodal deep research framework that grounds long-form reports in source figures. ViDR treats source figures as retrievable, interpretable, routable, and verifiable evidence objects, while still generating analytical charts when needed. It builds an evidence-indexed outline linking claims to textual and visual evidence, refines noisy web images into source-figure evidence atoms through context-aware filtering, outline-aware reranking, and VLM-based visual analysis, and generates each section with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.