Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning
Yutong Wang, Pengliang Ji, Kaixin Li, Baolong Bi, Tao Feng, and Guillaume Sartoretti

TL;DR
This paper introduces BPO, a three-stage data curation framework that enhances reasoning models for long-horizon, sparse-reward planning in interactive environments, achieving state-of-the-art results.
Contribution
The paper proposes a novel self-improving data flywheel framework with planning quaternions and curriculum learning for robust agentic reasoning.
Findings
Achieves state-of-the-art performance on ALFWorld, ScienceWorld, and WebShop.
Significantly improves token efficiency in reasoning models.
Demonstrates effective out-of-distribution generalization.
Abstract
Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-reward settings. Second, the computational overhead of verbose, step-by-step reasoning histories is prohibitive. To address these challenges, we propose BPO, a three-stage framework (bootstrapping, extrapolation, and refinement) that establishes a self-improving data flywheel to develop robust reasoning models for long-horizon, sparse-reward environments. Our framework first bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion. It then extrapolates to out-of-distribution tasks through complexity-stratified curriculum…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
