GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents
Xiongbin Wu, Zhihao Luo, Shanzhe Lei, Lechao Zhang, Xuhong Wang, Jie Yang, Zhonglong Zheng, Yuanjie Zheng, Xin Tan, Wei Liu

TL;DR
GROW introduces a reinforcement learning framework that improves open-world vision-language model agents by decomposing trajectories into state-action samples, enabling effective multi-turn learning and achieving state-of-the-art results in Minecraft tasks.
Contribution
The paper proposes GROW, a novel RL framework that adapts GRPO for multi-turn open-world VLM agents by decomposing trajectories, which enhances learning efficiency and performance.
Findings
Achieves state-of-the-art performance on over 800 Minecraft tasks.
Effectively decomposes trajectories into state-action samples for better RL training.
Provides surrogate analysis showing preservation of policy optimization signals.
Abstract
Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception and action execution. However, existing methods still rely primarily on Supervised Fine-Tuning (SFT) with expert demonstrations, while the advanced reinforcement learning (RL) algorithm, specifically Group Relative Policy Optimization (GRPO), has not been effectively employed for multi-turn RL in these tasks because standard GRPO requires full trajectories as training samples which leads to excessively long context and noise. To address this issue, we propose GROW, a RL framework for open-world VLM agents that decomposes collected trajectories into state-action samples, and computes advantages between these samples rather than treating a full trajectory as a single entity. We further provide a surrogate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
