MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Fangda Ye; Yuxin Hu; Pengxiang Zhu; Yibo Li; Ziqi Jin; Yao Xiao; Yibo Wang; Lei Wang; Zhen Zhang; Lu Wang; Yue Deng; Bin Wang; Yifan Zhang; Liangcai Su; Xinyu Wang; He Zhao; Chen Wei; Qiang Ren; Bryan Hooi; An Bo; Shuicheng Yan; Lidong Bing

arXiv:2603.28407·cs.AI·March 31, 2026

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, Yue Deng, Bin Wang, Yifan Zhang, Liangcai Su, Xinyu Wang, He Zhao, Chen Wei, Qiang Ren, Bryan Hooi, An Bo, Shuicheng Yan, Lidong Bing

PDF

1 Repo 1 Datasets

TL;DR

MiroEval is a comprehensive benchmark and evaluation framework for deep research systems, assessing process, factuality, and outcome across real-world, multimodal tasks with periodic updates.

Contribution

It introduces a novel, evolving benchmark with three evaluation dimensions, addressing gaps in existing assessments of deep research agents.

Findings

01

Evaluation dimensions capture complementary system capabilities.

02

Process quality predicts overall outcome and reveals hidden weaknesses.

03

Multimodal tasks are significantly more challenging for current systems.

Abstract

Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

miromindai/MiroEval
github

Datasets

miromind-ai/MiroEval-data
dataset· 288 dl
288 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.