GEBench: Benchmarking Image Generation Models as GUI Environments

Haodong Li; Jingwei Wu; Quan Sun; Guopeng Li; Juanxi Tian; Huanyu Zhang; Yanlin Lai; Ruichuan An; Hongbo Peng; Yuhong Dai; Chenxi Li; Chunmei Qing; Jia Wang; Ziyang Meng; Zheng Ge; Xiangyu Zhang; and Daxin Jiang

arXiv:2602.09007·cs.AI·February 11, 2026

GEBench: Benchmarking Image Generation Models as GUI Environments

Haodong Li, Jingwei Wu, Quan Sun, Guopeng Li, Juanxi Tian, Huanyu Zhang, Yanlin Lai, Ruichuan An, Hongbo Peng, Yuhong Dai, Chenxi Li, Chunmei Qing, Jia Wang, Ziyang Meng, Zheng Ge, Xiangyu Zhang, and Daxin Jiang

PDF

Open Access 1 Datasets

TL;DR

GEBench introduces a comprehensive benchmark and a novel metric for evaluating dynamic, multi-step GUI image generation models, emphasizing temporal coherence and interaction logic, revealing current models' limitations in long-term consistency.

Contribution

This work presents GEBench, a new benchmark with a five-dimensional evaluation metric for assessing GUI image generation models' temporal and interaction fidelity, addressing a significant evaluation gap.

Findings

01

Models perform well on single-step transitions.

02

Models struggle with temporal coherence in multi-step sequences.

03

Icon interpretation and localization are key bottlenecks.

Abstract

Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

stepfun-ai/GEBench
dataset· 1.3k dl
1.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis