Benchmarking and Improving GUI Agents in High-Dynamic Environments
Enqi Liu, Liyuan Pan, Zhi Gao, Yan Yang, Chenrui Shi, Yang Liu, Jingrong Wu, Qing Li

TL;DR
This paper introduces DynamicGUIBench, a comprehensive benchmark for high-dynamic GUI environments, and proposes DynamicUI, an agent that uses screen recordings and multi-component processing to improve decision-making in such settings.
Contribution
The paper presents a new benchmark for dynamic GUI environments and a novel agent architecture that leverages video input and multi-stage processing to enhance GUI interaction performance.
Findings
DynamicUI outperforms existing agents on DynamicGUIBench.
DynamicUI maintains competitive performance on static GUI benchmarks.
The approach effectively handles interface changes and noisy visual input.
Abstract
Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments remains largely underexplored. Existing agents typically rely on a single screenshot after each action for decision-making, leading to a partially observable (or even unobservable) Markov decision process, where the key GUI state including important information for actions is often inadequately captured. To systematically explore this challenge, we introduce DynamicGUIBench, a comprehensive online GUI benchmark spanning ten applications and diverse interaction scenarios characterized by important interface changes between actions. Furthermore, we present DynamicUI, an agent designed for dynamic interfaces, which takes screen-recording videos of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
