Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning
Tianduo Wang, Shichen Li, Wei Lu

TL;DR
This paper introduces a self-training method enhanced with Direct Preference Optimization to improve the reasoning capabilities of small language models, making them more accurate and scalable for mathematical reasoning tasks.
Contribution
The work demonstrates that combining self-training with DPO significantly enhances small LMs' reasoning abilities, offering a cost-effective alternative to large proprietary models.
Findings
Improved reasoning accuracy across multiple tasks
Enhanced diversity and correctness of generated reasoning
More scalable and cost-effective training approach
Abstract
Effective training of language models (LMs) for mathematical reasoning tasks demands high-quality supervised fine-tuning data. Besides obtaining annotations from human experts, a common alternative is sampling from larger and more powerful LMs. However, this knowledge distillation approach can be costly and unstable, particularly when relying on closed-source, proprietary LMs like GPT-4, whose behaviors are often unpredictable. In this work, we demonstrate that the reasoning abilities of small-scale LMs can be enhanced through self-training, a process where models learn from their own outputs. We also show that the conventional self-training can be further augmented by a preference learning algorithm called Direct Preference Optimization (DPO). By integrating DPO into self-training, we leverage preference data to guide LMs towards more accurate and diverse chain-of-thought reasoning. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive Science and Mapping
MethodsAttention Is All You Need · Direct Preference Optimization · Adam · Label Smoothing · Linear Layer · Byte Pair Encoding · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings
