EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents
Xuan Dong, Huanyang Zheng, Tianhao Niu, Zhe Han, Pengzhan Li, Bofei Liu, Zhengyang Liu, Guancheng Li, Qingfu Zhu, Wanxiang Che

TL;DR
EpiBench is a new benchmark for evaluating multimodal research agents in multi-turn workflows, emphasizing proactive search, evidence integration, and reproducibility.
Contribution
It introduces a comprehensive episodic benchmark and evaluation framework for multi-turn, multimodal research workflows, addressing gaps in existing benchmarks.
Findings
Leading models achieve only 29.23% accuracy on the hard split.
EpiBench enables fine-grained assessment of research agents' capabilities.
Substantial room for improvement in multi-evidence, multi-turn research tasks.
Abstract
Scientific research follows multi-turn, multi-step workflows that require proactively searching the literature, consulting figures and tables, and integrating evidence across papers to align experimental settings and support reproducible conclusions. This joint capability is not systematically assessed in existing benchmarks, which largely under-evaluate proactive search, multi-evidence integration and sustained evidence use over time. In this work, we introduce EpiBench, an episodic multi-turn multimodal benchmark that instantiates short research workflows. Given a research task, agents must navigate across papers over multiple turns, align evidence from figures and tables, and use the accumulated evidence in the memory to answer objective questions that require cross paper comparisons and multi-figure integration. EpiBench introduces a process-level evaluation framework for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
