RPO:Reinforcement Fine-Tuning with Partial Reasoning Optimization

Hongzhu Yi; Xinming Wang; Zhenghao zhang; Tianyu Zong; Yuanxiang Wang; Jun Xie; Tao Yu; Haopeng Jin; Kaixin Xu; Feng Chen; Jiahuan Chen; Yujia Yang; Zhenyu Guan; Bingkang Shi; Jungang Xu

arXiv:2601.19404·cs.AI·February 2, 2026

RPO:Reinforcement Fine-Tuning with Partial Reasoning Optimization

Hongzhu Yi, Xinming Wang, Zhenghao zhang, Tianyu Zong, Yuanxiang Wang, Jun Xie, Tao Yu, Haopeng Jin, Kaixin Xu, Feng Chen, Jiahuan Chen, Yujia Yang, Zhenyu Guan, Bingkang Shi, Jungang Xu

PDF

Open Access

TL;DR

RPO is a novel reinforcement fine-tuning method that significantly reduces training time by generating only partial reasoning paths, maintaining performance while lowering computational costs.

Contribution

It introduces a plug-and-play algorithm that trains models using reasoning suffixes, drastically reducing token generation and training time compared to traditional methods.

Findings

01

Reduces token generation during training by approximately 95%.

02

Decreases training time of 1.5B model by 90% and 7B model by 72%.

03

Maintains comparable performance with traditional reinforcement fine-tuning algorithms.

Abstract

Within the domain of large language models, reinforcement fine-tuning algorithms necessitate the generation of a complete reasoning trajectory beginning from the input query, which incurs significant computational overhead during the rollout phase of training. To address this issue, we analyze the impact of different segments of the reasoning path on the correctness of the final result and, based on these insights, propose Reinforcement Fine-Tuning with Partial Reasoning Optimization (RPO), a plug-and-play reinforcement fine-tuning algorithm. Unlike traditional reinforcement fine-tuning algorithms that generate full reasoning paths, RPO trains the model by generating suffixes of the reasoning path using experience cache. During the rollout phase of training, RPO reduces token generation in this phase by approximately 95%, greatly lowering the theoretical time overhead. Compared with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling