GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

Yang Li; Yuchen Liu; Haoyu Lu; Zhiqiang Xia; Hongzhen Wang; Kaiyang Han; Changpeng Yang; Jinyang Wu; Jiaming Xu; Runyu Shi; Ying Huang

arXiv:2603.15039·cs.CV·March 17, 2026

GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

Yang Li, Yuchen Liu, Haoyu Lu, Zhiqiang Xia, Hongzhen Wang, Kaiyang Han, Changpeng Yang, Jinyang Wu, Jiaming Xu, Runyu Shi, Ying Huang

PDF

Open Access

TL;DR

GUI-CEval is a comprehensive Chinese benchmark for mobile GUI agents, evaluating perception, planning, reflection, execution, and evaluation across diverse apps and device types to improve model reliability.

Contribution

This paper introduces GUI-CEval, the first detailed Chinese mobile GUI benchmark built on physical devices, covering full capability assessment from perception to execution.

Findings

01

Models like Qwen2.5-VL and UI-TARS perform well

02

Most MLLMs struggle with reflective decision-making

03

Weaknesses in self-evaluation limit real-world reliability

Abstract

Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, cross-modal reasoning, and interactive control. However, existing benchmarks are largely English-centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine-grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on physical device environments. GUI-CEval spans 201 mainstream apps across four device types and adopts a two-level structure that evaluates both atomic abilities and realistic application-level performance along five dimensions: perception, planning,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling