Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning

Yutong Wang; Pengliang Ji; Kaixin Li; Baolong Bi; Tao Feng; and Guillaume Sartoretti

arXiv:2508.03018·cs.AI·May 19, 2026

Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning

Yutong Wang, Pengliang Ji, Kaixin Li, Baolong Bi, Tao Feng, and Guillaume Sartoretti

PDF

TL;DR

This paper introduces BPO, a three-stage data curation framework that enhances reasoning models for long-horizon, sparse-reward planning in interactive environments, achieving state-of-the-art results.

Contribution

The paper proposes a novel self-improving data flywheel framework with planning quaternions and curriculum learning for robust agentic reasoning.

Findings

01

Achieves state-of-the-art performance on ALFWorld, ScienceWorld, and WebShop.

02

Significantly improves token efficiency in reasoning models.

03

Demonstrates effective out-of-distribution generalization.

Abstract

Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-reward settings. Second, the computational overhead of verbose, step-by-step reasoning histories is prohibitive. To address these challenges, we propose BPO, a three-stage framework (bootstrapping, extrapolation, and refinement) that establishes a self-improving data flywheel to develop robust reasoning models for long-horizon, sparse-reward environments. Our framework first bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion. It then extrapolates to out-of-distribution tasks through complexity-stratified curriculum…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.