ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
Zhuofan Shi, Peilun Jia, Baoqin Sun, Haiyang Shen, Sixiong Xie, Yun Ma, and Xiang Jing

TL;DR
ViDR is a multimodal framework that enhances deep research reports by effectively grounding them in source visual evidence, improving evidence integration and report verifiability.
Contribution
It introduces a novel approach to utilize source figures as verifiable evidence objects and presents MMR Bench+ for evaluating visual evidence use.
Findings
ViDR improves report quality and evidence integration.
It enhances verifiability and reduces hallucinated figures.
Experiments outperform strong baselines.
Abstract
Recent deep research systems have improved the ability of large language models to produce long, grounded reports through iterative retrieval and reasoning. However, most text-centered systems rely mainly on textual evidence, while multimodal systems often retrieve images only weakly or generate charts themselves, leaving source figures underused as evidence. We present ViDR, a multimodal deep research framework that grounds long-form reports in source figures. ViDR treats source figures as retrievable, interpretable, routable, and verifiable evidence objects, while still generating analytical charts when needed. It builds an evidence-indexed outline linking claims to textual and visual evidence, refines noisy web images into source-figure evidence atoms through context-aware filtering, outline-aware reranking, and VLM-based visual analysis, and generates each section with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
