Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise   Rewards for Mathematical Reasoning

Huimin Xu; Xin Mao; Feng-Lin Li; Xiaobao Wu; Wang Chen; Wei Zhang and; Anh Tuan Luu

arXiv:2502.14356·cs.CL·February 21, 2025

Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning

Huimin Xu, Xin Mao, Feng-Lin Li, Xiaobao Wu, Wang Chen, Wei Zhang and, Anh Tuan Luu

PDF

Open Access

TL;DR

Full-Step-DPO introduces a self-supervised, step-wise reward framework for mathematical reasoning, improving over existing methods by leveraging entire reasoning chains and dynamic gradient updates, leading to better reasoning performance.

Contribution

It proposes a novel self-supervised reward model and a step-wise DPO loss that enhance language models' mathematical reasoning abilities.

Findings

01

Outperforms state-of-the-art baselines on multiple benchmarks

02

Effective in both in-domain and out-of-domain scenarios

03

Improves reasoning accuracy across various base models

Abstract

Direct Preference Optimization (DPO) often struggles with long-chain mathematical reasoning. Existing approaches, such as Step-DPO, typically improve this by focusing on the first erroneous step in the reasoning chain. However, they overlook all other steps and rely heavily on humans or GPT-4 to identify erroneous steps. To address these issues, we propose Full-Step-DPO, a novel DPO framework tailored for mathematical reasoning. Instead of optimizing only the first erroneous step, it leverages step-wise rewards from the entire reasoning chain. This is achieved by training a self-supervised process reward model, which automatically scores each step, providing rewards while avoiding reliance on external signals. Furthermore, we introduce a novel step-wise DPO loss, which dynamically updates gradients based on these step-wise rewards. This endows stronger reasoning capabilities to language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Constraint Satisfaction and Optimization · Fuzzy Logic and Control Systems

MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Label Smoothing · Multi-Head Attention · Direct Preference Optimization