SPPD: Self-training with Process Preference Learning Using Dynamic Value   Margin

Hao Yi; Qingyang Li; Yulan Hu; Fuzheng Zhang; Di Zhang; Yong Liu

arXiv:2502.13516·cs.AI·February 20, 2025

SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin

Hao Yi, Qingyang Li, Yulan Hu, Fuzheng Zhang, Di Zhang, Yong Liu

PDF

Open Access

TL;DR

This paper introduces SPPD, a novel self-training framework that uses process preference learning with dynamic value margins, improving reasoning capabilities of large language models without relying on distillation or human annotations.

Contribution

The paper proposes SPPD, a self-training method utilizing process-based MDP and dynamic value margins, theoretically linking it to policy gradient methods and demonstrating superior performance.

Findings

01

Outperforms existing methods on mathematical benchmarks

02

Eliminates need for distillation or human annotations

03

Proves theoretical equivalence to policy gradient methods

Abstract

Recently, enhancing the numerical and logical reasoning capability of Large Language Models (LLMs) has emerged as a research hotspot. Existing methods face several limitations: inference-phase techniques (e.g., Chain of Thoughts) rely on prompt selection and the pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle with step-wise mathematical correctness and depend on stronger models distillation or human annotations; while Reinforcement Learning (RL) approaches incur high GPU memory costs and unstable training. To address these, we propose \textbf{S}elf-training framework integrating \textbf{P}rocess \textbf{P}reference learning using \textbf{D}ynamic value margin (SPPD). SPPD leverages a process-based Markov Decision Process (MDP) and Bellman optimality equation to derive \textbf{dynamic value margin} on step-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis