GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning

Longxi Gao; Li Zhang; Pengzhi Gao; Wei Liu; Jian Luan; Mengwei Xu

arXiv:2505.12493·cs.AI·October 13, 2025

GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning

Longxi Gao, Li Zhang, Pengzhi Gao, Wei Liu, Jian Luan, Mengwei Xu

PDF

Open Access 3 Reviews

TL;DR

GUI-Shift introduces a self-supervised reinforcement learning framework that enhances vision-language models for GUI tasks by predicting GUI transitions, reducing reliance on annotated data, and improving performance across multiple benchmarks.

Contribution

The paper presents GUI-Shift, a novel self-supervised RL method that leverages GUI dynamics prediction to improve VLM-based GUI agents without requiring natural language instructions.

Findings

01

Up to 11.2% increase in GUI automation accuracy

02

Effective generalization across multiple GUI benchmarks

03

Scalable training without annotated datasets

Abstract

Training effective Vision-Language Models (VLMs) for GUI agents typically depends on large-scale annotated datasets, whose collection is both labor-intensive and error-prone. We introduce K-step GUI Transition, a self-supervised inverse dynamics task in which VLMs learn GUI dynamics by predicting the initial action that causes a transition between two GUI states. This approach eliminates the need for natural language instructions and enables scalable dataset construction from existing GUI trajectories or automated exploration. Building on this task, we propose GUI-Shift, a reinforcement learning (RL) framework that combines rule-based optimization with data filtering to improve VLM performance. We conduct extensive experiments using multiple VLM backbones across four benchmarks, spanning GUI task automation (AndroidControl, GUI Odyssey) and GUI grounding (ScreenSpot-v2, ScreenSpot-Pro).…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

This paper proposed a novel GUI task, k-step GUI transition task, which enables self-supervised training and eliminate dataset construction.

Weaknesses

Despite the efficiency of the method, mechanisms and deeper analysis are somewhat lacking - please see details in questions part.

Reviewer 02Rating 4Confidence 4

Strengths

GUI-Shift is practical and scalable because it builds supervision directly from real GUI trajectories (no manual labels) and trains with a GRPO sampling-and-ranking loop that better accommodates multiple plausible actions than single-label SFT, while its rule-verified rewards provide precise, automatic checks on action type and arguments. The unified eight-action interface and “keep only informative cases” filtering keep optimization focused and stable, and dropping explicit reasoning traces cut

Weaknesses

The method treats the trajectory’s first action as the only gold label. Rule-based rewards allow coordinate variance (e.g., any point inside the bbox) but not type/argument alternatives, so equally valid first moves that still reach S_(t+k) get penalized, biasing against strategy-level equivalence. Training/eval center on AndroidControl and GUI Odyssey (mobile). Grounding sets are included, but not multi-step control. Metrics are mostly single-step TM/EM, no end-to-end success rates, latency/to

Reviewer 03Rating 6Confidence 4

Strengths

S1: The K-step Transition objective is a compact way to leverage unlabeled trajectories at scale. The visual-goal formulation reduces reliance on noisy textual instructions and naturally encourages temporal reasoning. S2: GRPO avoids penalizing multiple valid clicks (any point in a control) and removes the need for a critic. The binary-format action reward is stable, inexpensive, and reproducible. S3: On AndroidControl-High, GUI-Shift-Qwen improves EM by +11.2% (to 70.4%), with other base mod

Weaknesses

W1: Training relies solely on the AndroidControl split, which undercuts the claim that the method scales to arbitrary “unlabeled trajectories.” W2: Performance dips on GUI Odyssey for some models are attributed to tablet layouts; ScreenSpot‑Pro (desktop, high‑res) gains are modest. W3: The work evaluates offline action accuracy and grounding. Demonstrating end‑to‑end success in a dynamic environment like AndroidWorld would strengthen claims about task automation robustness.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsFocus · Shrink and Fine-Tune