MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

Peizhou Huang; Zixuan Zhong; Zhongwei Wan; Donghao Zhou; Samiul Alam; Xin Wang; Zexin Li; Zhihao Dou; Li Zhu; Jing Xiong; Chaofan Tao; Yan Xu; Dimitrios Dimitriadis; Tuo Zhang; Mi Zhang

arXiv:2601.12346·cs.CV·January 21, 2026

MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

Peizhou Huang, Zixuan Zhong, Zhongwei Wan, Donghao Zhou, Samiul Alam, Xin Wang, Zexin Li, Zhihao Dou, Li Zhu, Jing Xiong, Chaofan Tao, Yan Xu, Dimitrios Dimitriadis, Tuo Zhang, Mi Zhang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces MMDeepResearch-Bench, a comprehensive benchmark with 140 tasks across 21 domains to evaluate multimodal deep research agents in generating citation-rich, evidence-grounded reports connecting images and text.

Contribution

It presents a new end-to-end multimodal benchmark and an interpretable evaluation pipeline for assessing report quality, citation accuracy, and visual-text integrity in deep research agents.

Findings

01

Models show trade-offs between report quality and evidence fidelity.

02

Strong prose does not ensure faithful evidence use.

03

Multimodal integrity is a key bottleneck for research agents.

Abstract

Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-crafted tasks across 21 domains, where each task provides an image-text bundle to evaluate multimodal understanding and citation-grounded report generation. Compared to prior setups, MMDR-Bench emphasizes report-style synthesis with explicit evidence use, where models must connect visual artifacts to sourced claims and maintain consistency across narrative, citations, and visual references. We further propose a unified, interpretable evaluation pipeline: Formula-LLM Adaptive Evaluation (FLAE) for report quality, Trustworthy Retrieval-Aligned Citation Evaluation (TRACE) for citation-grounded…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

MMDR-2025/MMdeepresearch
dataset· 1.5k dl
1.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Artificial Intelligence in Healthcare and Education