From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation

Zezhou Wang; Ziyun Zhang; Xiaoyi Zhang; Zhuzhong Qian; Yan Lu

arXiv:2601.05787·cs.AI·February 11, 2026

From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation

Zezhou Wang, Ziyun Zhang, Xiaoyi Zhang, Zhuzhong Qian, Yan Lu

PDF

Open Access 1 Models

TL;DR

This paper introduces BEPA, a novel method that leverages expert trajectories to improve end-to-end GUI agents by aligning static expert data with the learner's policy through a bi-level assimilation process, enhancing performance on benchmark tasks.

Contribution

BEPA presents a new bi-level approach to incorporate expert trajectories into reinforcement learning for GUI agents, addressing distribution mismatch and improving success rates on benchmark environments.

Findings

01

BEPA increases success rates on OSWorld-Verified from 22.87% to 32.13%.

02

BEPA improves performance on MMBench-GUI and Online-Mind2Web benchmarks.

03

The method effectively aligns expert data with the learner's policy, enabling better utilization of limited expert trajectories.

Abstract

Vision-language models are increasingly deployed as computer-use agents (CUAs) that operate desktops and browsers. Top-performing CUAs are framework-based systems that decompose planning and execution, while end-to-end screenshot-to-action policies are easier to deploy but lag behind on benchmarks such as OSWorld-Verified. GUI datasets like OSWorld pose two bottlenecks: they expose only a few hundred interactive, verifiable tasks and environments, and expert trajectories must be gathered by interacting with these environments, making such data hard to scale. We therefore ask how reinforcement learning from verifiable rewards (RLVR) can best exploit a small pool of exist expert trajectories to train end-to-end policies. Naively mixing these off-policy traces into on-policy RLVR is brittle: even after format conversion, expert trajectories exhibit structural mismatch and distribution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
LEONW24/BEPA-7B-S2
model· 4 dl
4 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning