MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

Xuehui Wang; Zhenyu Wu; JingJing Xie; Zichen Ding; Bowen Yang; Zehao Li; Zhaoyang Liu; Qingyun Li; Xuan Dong; Zhe Chen; Weiyun Wang; Xiangyu Zhao; Jixuan Chen; Haodong Duan; Tianbao Xie; Chenyu Yang; Shiqian Su; Yue Yu; Yuan Huang; Yiqian Liu; Xiao Zhang; Yanting Zhang; Xiangyu Yue; Weijie Su; Xizhou Zhu; Wei Shen; Jifeng Dai; Wenhai Wang

arXiv:2507.19478·cs.CV·July 28, 2025

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Yanting Zhang

PDF

Open Access 1 Datasets 4 Reviews

TL;DR

This paper presents MMBench-GUI, a comprehensive hierarchical benchmark for evaluating GUI automation agents across multiple platforms, introducing a new efficiency metric and emphasizing the importance of visual grounding, planning, and efficiency in GUI tasks.

Contribution

The paper introduces MMBench-GUI, a multi-level benchmark for cross-platform GUI agent evaluation, and proposes a novel efficiency metric to assess automation performance.

Findings

01

Accurate visual grounding is crucial for task success.

02

Modular grounding modules significantly improve performance.

03

Efficiency issues persist across models, highlighting need for better strategies.

Abstract

We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. The papers exhaustively cover different abilities of GUI agents and diverse and different platforms. The motivation to do this hierarchically is sound. 2. The authors introduce MacOSArena, which includes 70 tasks for MacOS, is a novel contribution. This is in addition to L1 and L2 images they curate and the tasks they create for the same. 3. The paper analyzes the efficiency of GUI agents in performing tasks in addition to task success. Such a type of evaluation is much needed.

Weaknesses

1. My main point of concern is that this benchmark does not give new insights regarding GUI agents. I do agree that unifying different tasks in one framework is useful, but I don't see how doing so allows authors to come up with new insights (please see my comments below to know my problems with the tasks). Overall, I feel several works already point out the issues that the authors learn from this benchmark, which brings an important question as to its practical utility. 1. For L1, a related wo

Reviewer 02Rating 6Confidence 4

Strengths

S1: The four-level framework naturally decomposes GUI agent capabilities from basic understanding to complex collaboration, enabling fine-grained diagnosis. S2: The benchmark spans all major platforms (Windows, macOS, Linux, iOS, Android, Web) and has a balanced data distribution, addressing a gap in existing work. S3: Testing 10+ models with detailed analysis across platforms and difficulty levels provides useful insights for the community.

Weaknesses

W1: L1/L2 rely heavily on automated generation via Claude/GPT models, with limited details on quality control using manual sampling. For manual sampling, what percentage of LLM-generated questions were rejected during manual review? What specific issues were found? More details could be provided. W2: The continuous-time integral formulation of EQA (Equation (11)) is somewhat opaque. Have the authors compared EQA to other efficiency metrics? How sensitive is it to the choice of M=101 evaluation

Reviewer 03Rating 2Confidence 5

Strengths

1. Benchmark covering six major platforms and four hierarchical capability levels. 2. Introduces an Efficiency–Quality-Aware (EQA) metric combining accuracy and efficiency. 3. Offers extensive empirical results across a wide range of proprietary and open-source GUI agents. 4. Proposes a transparent and well-documented data construction process.

Weaknesses

1. **No contribution to learning representations or algorithms.** The work is purely an evaluation benchmark with no proposed model, training method, or learning insight, misaligned with ICLR’s core focus on representation learning. 2. **Limited conceptual novelty.** The four-level hierarchy (understanding → grounding → automation → collaboration) closely mirrors existing works such as OSWorld, ScreenSpot-Pro, and UI-TARS, amounting to a reorganization rather than a conceptual advance

Reviewer 04Rating 6Confidence 3

Strengths

This is one of the first to unify GUI agent evaluation across platforms and hierarchical capability levels, through low-level perception (understanding, grounding) to high-level reasoning (automation, collaboration). I like the four-level evaluation structure. The proposed EQA metric also sounds good, which could address an issue ignored by prior success-only metrics. The experiments on this benchmark by including major open-source and proprietary VLM/LLM systems are also well done, which is b

Weaknesses

The paper’s primary contribution is the benchmark design, not new algorithmic or modeling techniques. It's useful to the CUA research, but it's hard to say if its technical contributions are sufficient as an ICLR paper. And also, prior works like OSWorld already explore parts of these capabilities, and the paper’s novelty lies mainly in unifying them, not in entirely new data or task types.

Code & Models

Datasets

OpenGVLab/MMBench-GUI
dataset· 128 dl
128 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Software Engineering Methodologies · Human-Automation Interaction and Safety · Social Robot Interaction and HRI