Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning
Yiliu Sun, Zicheng Zhao, Yang Wei, Yanfang Zhang, Chen Gong

TL;DR
This paper introduces PPPO, a novel reinforcement learning approach focusing on prefix tokens in LLM reasoning, leading to significant accuracy improvements by optimizing early reasoning steps and employing targeted training strategies.
Contribution
The paper proposes PPPO, a new RLVR method that emphasizes prefix token optimization in LLM reasoning, inspired by human thinking theory, with strategies to enhance training efficiency and reasoning quality.
Findings
PPPO outperforms existing RLVR methods in reasoning tasks.
Achieves 18.02% accuracy improvement with only 26.17% of training tokens.
Effective in enhancing early reasoning steps and overall model performance.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
