GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning
Longxi Gao, Li Zhang, Pengzhi Gao, Wei Liu, Jian Luan, Mengwei Xu

TL;DR
GUI-Shift introduces a self-supervised reinforcement learning framework that enhances vision-language models for GUI tasks by predicting GUI transitions, reducing reliance on annotated data, and improving performance across multiple benchmarks.
Contribution
The paper presents GUI-Shift, a novel self-supervised RL method that leverages GUI dynamics prediction to improve VLM-based GUI agents without requiring natural language instructions.
Findings
Up to 11.2% increase in GUI automation accuracy
Effective generalization across multiple GUI benchmarks
Scalable training without annotated datasets
Abstract
Training effective Vision-Language Models (VLMs) for GUI agents typically depends on large-scale annotated datasets, whose collection is both labor-intensive and error-prone. We introduce K-step GUI Transition, a self-supervised inverse dynamics task in which VLMs learn GUI dynamics by predicting the initial action that causes a transition between two GUI states. This approach eliminates the need for natural language instructions and enables scalable dataset construction from existing GUI trajectories or automated exploration. Building on this task, we propose GUI-Shift, a reinforcement learning (RL) framework that combines rule-based optimization with data filtering to improve VLM performance. We conduct extensive experiments using multiple VLM backbones across four benchmarks, spanning GUI task automation (AndroidControl, GUI Odyssey) and GUI grounding (ScreenSpot-v2, ScreenSpot-Pro).…
Peer Reviews
Decision·ICLR 2026 Poster
This paper proposed a novel GUI task, k-step GUI transition task, which enables self-supervised training and eliminate dataset construction.
Despite the efficiency of the method, mechanisms and deeper analysis are somewhat lacking - please see details in questions part.
GUI-Shift is practical and scalable because it builds supervision directly from real GUI trajectories (no manual labels) and trains with a GRPO sampling-and-ranking loop that better accommodates multiple plausible actions than single-label SFT, while its rule-verified rewards provide precise, automatic checks on action type and arguments. The unified eight-action interface and “keep only informative cases” filtering keep optimization focused and stable, and dropping explicit reasoning traces cut
The method treats the trajectory’s first action as the only gold label. Rule-based rewards allow coordinate variance (e.g., any point inside the bbox) but not type/argument alternatives, so equally valid first moves that still reach S_(t+k) get penalized, biasing against strategy-level equivalence. Training/eval center on AndroidControl and GUI Odyssey (mobile). Grounding sets are included, but not multi-step control. Metrics are mostly single-step TM/EM, no end-to-end success rates, latency/to
S1: The K-step Transition objective is a compact way to leverage unlabeled trajectories at scale. The visual-goal formulation reduces reliance on noisy textual instructions and naturally encourages temporal reasoning. S2: GRPO avoids penalizing multiple valid clicks (any point in a control) and removes the need for a critic. The binary-format action reward is stable, inexpensive, and reproducible. S3: On AndroidControl-High, GUI-Shift-Qwen improves EM by +11.2% (to 70.4%), with other base mod
W1: Training relies solely on the AndroidControl split, which undercuts the claim that the method scales to arbitrary “unlabeled trajectories.” W2: Performance dips on GUI Odyssey for some models are attributed to tablet layouts; ScreenSpot‑Pro (desktop, high‑res) gains are modest. W3: The work evaluates offline action accuracy and grounding. Demonstrating end‑to‑end success in a dynamic environment like AndroidWorld would strengthen claims about task automation robustness.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsFocus · Shrink and Fine-Tune
