VEM: Environment-Free Exploration for Training GUI Agent with Value   Environment Model

Jiani Zheng; Lu Wang; Fangkai Yang; Chaoyun Zhang; Lingrui Mei; Wenjie; Yin; Qingwei Lin; Dongmei Zhang; Saravan Rajmohan; Qi Zhang

arXiv:2502.18906·cs.LG·February 27, 2025

VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

Jiani Zheng, Lu Wang, Fangkai Yang, Chaoyun Zhang, Lingrui Mei, Wenjie, Yin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces VEM, an environment-free reinforcement learning framework for GUI agents that uses a pretrained value environment model to estimate long-term action utility from offline data, improving robustness and performance.

Contribution

The paper proposes a novel environment-free RL approach using VEM to decouple value estimation from policy, enabling effective GUI automation without environment interactions.

Findings

01

VEM achieves state-of-the-art results on Android-in-the-Wild benchmarks.

02

VEM outperforms other environment-free methods significantly.

03

VEM matches the performance of environment-based approaches without interaction costs.

Abstract

Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples value estimation from policy optimization by leveraging a pretrained Value Environment Model (VEM). VEM predicts state-action values directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., Does this action advance the user's goal?). The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/gui-agent-rl
jax

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning