Reinforced Reasoning for Embodied Planning
Di Wu, Jiaxin Fan, Junzhe Zang, Guanbo Wang, Wei Yin, Wenhao Li, Bo Jin

TL;DR
This paper introduces a reinforcement fine-tuning framework that enhances embodied planning by integrating reasoning capabilities, leading to significant performance improvements on interactive environment benchmarks.
Contribution
It presents a novel reinforcement fine-tuning approach that incorporates reasoning into embodied planning models, improving multi-step decision-making in dynamic environments.
Findings
Outperforms similar or larger models on Embench benchmark
Shows strong generalization to unseen environments
Demonstrates the effectiveness of reinforcement-driven reasoning in embodied AI
Abstract
Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals. While recent vision-language models (VLMs) excel at static perception tasks, they struggle with the temporal reasoning, spatial understanding, and commonsense grounding needed for planning in interactive environments. In this work, we introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning. We first distill a high-quality dataset from a powerful closed-source model and perform supervised fine-tuning (SFT) to equip the model with structured decision-making priors. We then design a rule-based reward function tailored to multi-step action quality and optimize the policy via Generalized Reinforced Preference Optimization (GRPO). Our approach is evaluated on Embench, a recent benchmark for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Robotic Path Planning Algorithms
