GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

Xiongbin Wu; Zhihao Luo; Shanzhe Lei; Lechao Zhang; Xuhong Wang; Jie Yang; Zhonglong Zheng; Yuanjie Zheng; Xin Tan; Wei Liu

arXiv:2605.20246·cs.LG·May 22, 2026

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

Xiongbin Wu, Zhihao Luo, Shanzhe Lei, Lechao Zhang, Xuhong Wang, Jie Yang, Zhonglong Zheng, Yuanjie Zheng, Xin Tan, Wei Liu

PDF

TL;DR

GROW introduces a reinforcement learning framework that improves open-world vision-language model agents by decomposing trajectories into state-action samples, enabling effective multi-turn learning and achieving state-of-the-art results in Minecraft tasks.

Contribution

The paper proposes GROW, a novel RL framework that adapts GRPO for multi-turn open-world VLM agents by decomposing trajectories, which enhances learning efficiency and performance.

Findings

01

Achieves state-of-the-art performance on over 800 Minecraft tasks.

02

Effectively decomposes trajectories into state-action samples for better RL training.

03

Provides surrogate analysis showing preservation of policy optimization signals.

Abstract

Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception and action execution. However, existing methods still rely primarily on Supervised Fine-Tuning (SFT) with expert demonstrations, while the advanced reinforcement learning (RL) algorithm, specifically Group Relative Policy Optimization (GRPO), has not been effectively employed for multi-turn RL in these tasks because standard GRPO requires full trajectories as training samples which leads to excessively long context and noise. To address this issue, we propose GROW, a RL framework for open-world VLM agents that decomposes collected trajectories into state-action samples, and computes advantages between these samples rather than treating a full trajectory as a single entity. We further provide a surrogate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.