What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

Wendong Bu; Yang Wu; Qifan Yu; Minghe Gao; Bingchen Miao; Zhenkui Zhang; Kaihang Pan; Yunfei Li; Mengze Li; Wei Ji; Juncheng Li; Siliang Tang; Yueting Zhuang

arXiv:2506.08933·cs.CV·June 11, 2025

What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

Wendong Bu, Yang Wu, Qifan Yu, Minghe Gao, Bingchen Miao, Zhenkui Zhang, Kaihang Pan, Yunfei Li, Mengze Li, Wei Ji, Juncheng Li, Siliang Tang, Yueting Zhuang

PDF

Open Access

TL;DR

OmniBench introduces a scalable, multidimensional benchmark with automated task synthesis for evaluating and advancing virtual agent capabilities across diverse scenarios.

Contribution

We developed OmniBench and OmniEval, enabling controllable, multidimensional evaluation of virtual agents with synthesized tasks and comprehensive metrics.

Findings

01

High human acceptance rate of 91% for synthesized tasks

02

Graph-structured data improves training efficiency

03

Performance variability across models and capabilities

Abstract

As multimodal large language models (MLLMs) advance, MLLM-based virtual agents have demonstrated remarkable performance. However, existing benchmarks face significant limitations, including uncontrollable task complexity, extensive manual annotation with limited scenarios, and a lack of multidimensional evaluation. In response to these challenges, we introduce OmniBench, a self-generating, cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity through subtask composition. To evaluate the diverse capabilities of virtual agents on the graph, we further present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities. Our synthesized dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91\% human acceptance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)