Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning

Yiliu Sun; Zicheng Zhao; Yang Wei; Yanfang Zhang; Chen Gong

arXiv:2512.15274·cs.CL·December 18, 2025

Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning

Yiliu Sun, Zicheng Zhao, Yang Wei, Yanfang Zhang, Chen Gong

PDF

Open Access 1 Video

TL;DR

This paper introduces PPPO, a novel reinforcement learning approach focusing on prefix tokens in LLM reasoning, leading to significant accuracy improvements by optimizing early reasoning steps and employing targeted training strategies.

Contribution

The paper proposes PPPO, a new RLVR method that emphasizes prefix token optimization in LLM reasoning, inspired by human thinking theory, with strategies to enhance training efficiency and reasoning quality.

Findings

01

PPPO outperforms existing RLVR methods in reasoning tasks.

02

Achieves 18.02% accuracy improvement with only 26.17% of training tokens.

03

Effective in enhancing early reasoning steps and overall model performance.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification