Self-Training with Direct Preference Optimization Improves   Chain-of-Thought Reasoning

Tianduo Wang; Shichen Li; Wei Lu

arXiv:2407.18248·cs.CL·July 26, 2024

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Tianduo Wang, Shichen Li, Wei Lu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a self-training method enhanced with Direct Preference Optimization to improve the reasoning capabilities of small language models, making them more accurate and scalable for mathematical reasoning tasks.

Contribution

The work demonstrates that combining self-training with DPO significantly enhances small LMs' reasoning abilities, offering a cost-effective alternative to large proprietary models.

Findings

01

Improved reasoning accuracy across multiple tasks

02

Enhanced diversity and correctness of generated reasoning

03

More scalable and cost-effective training approach

Abstract

Effective training of language models (LMs) for mathematical reasoning tasks demands high-quality supervised fine-tuning data. Besides obtaining annotations from human experts, a common alternative is sampling from larger and more powerful LMs. However, this knowledge distillation approach can be costly and unstable, particularly when relying on closed-source, proprietary LMs like GPT-4, whose behaviors are often unpredictable. In this work, we demonstrate that the reasoning abilities of small-scale LMs can be enhanced through self-training, a process where models learn from their own outputs. We also show that the conventional self-training can be further augmented by a preference learning algorithm called Direct Preference Optimization (DPO). By integrating DPO into self-training, we leverage preference data to guide LMs towards more accurate and diverse chain-of-thought reasoning. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tianduowang/dpo-st
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCognitive Science and Mapping

MethodsAttention Is All You Need · Direct Preference Optimization · Adam · Label Smoothing · Linear Layer · Byte Pair Encoding · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings