ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training

Rushuai Yang; Hecheng Wang; Chiming Liu; Xiaohan Yan; Yunlong Wang; Xuan Du; Shuoyu Yue; Yongcheng Liu; Chuheng Zhang; Lizhe Qi; Yi Chen; Wei Shan; Maoqing Yao

arXiv:2602.12691·cs.RO·February 24, 2026

ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training

Rushuai Yang, Hecheng Wang, Chiming Liu, Xiaohan Yan, Yunlong Wang, Xuan Du, Shuoyu Yue, Yongcheng Liu, Chuheng Zhang, Lizhe Qi, Yi Chen, Wei Shan, Maoqing Yao

PDF

Open Access

TL;DR

ALOE introduces an action-level off-policy evaluation method for vision-language-action models, enabling more effective reinforcement learning in real-world tasks by evaluating individual actions rather than entire trajectories.

Contribution

The paper proposes ALOE, a novel chunking-based temporal-difference bootstrapping framework for off-policy evaluation of individual actions in VLA systems, enhancing learning stability and efficiency.

Findings

01

Improves learning efficiency across diverse real-world tasks.

02

Supports stable policy improvement without sacrificing execution speed.

03

Effective in high-precision, long-horizon, and multi-object perception tasks.

Abstract

We study how to improve large foundation vision-language-action (VLA) systems through online reinforcement learning (RL) in real-world settings. Central to this process is the value function, which provides learning signals to guide VLA learning from experience. In practice, the value function is estimated from trajectory fragments collected from different data sources, including historical policies and intermittent human interventions. Estimating the value function of current behavior quality from the mixture data is inherently an off-policy evaluation problem. However, prior work often adopts conservative on-policy estimation for stability, which avoids direct evaluation of the current high-capacity policy and limits learning effectiveness. In this paper, we propose ALOE, an action-level off-policy evaluation framework for VLA post-training. ALOE applies chunking-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Robot Manipulation and Learning