SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin
Hao Yi, Qingyang Li, Yulan Hu, Fuzheng Zhang, Di Zhang, Yong Liu

TL;DR
This paper introduces SPPD, a novel self-training framework that uses process preference learning with dynamic value margins, improving reasoning capabilities of large language models without relying on distillation or human annotations.
Contribution
The paper proposes SPPD, a self-training method utilizing process-based MDP and dynamic value margins, theoretically linking it to policy gradient methods and demonstrating superior performance.
Findings
Outperforms existing methods on mathematical benchmarks
Eliminates need for distillation or human annotations
Proves theoretical equivalence to policy gradient methods
Abstract
Recently, enhancing the numerical and logical reasoning capability of Large Language Models (LLMs) has emerged as a research hotspot. Existing methods face several limitations: inference-phase techniques (e.g., Chain of Thoughts) rely on prompt selection and the pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle with step-wise mathematical correctness and depend on stronger models distillation or human annotations; while Reinforcement Learning (RL) approaches incur high GPU memory costs and unstable training. To address these, we propose \textbf{S}elf-training framework integrating \textbf{P}rocess \textbf{P}reference learning using \textbf{D}ynamic value margin (SPPD). SPPD leverages a process-based Markov Decision Process (MDP) and Bellman optimality equation to derive \textbf{dynamic value margin} on step-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBusiness Process Modeling and Analysis
