Iterative Reasoning Preference Optimization
Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar, Sukhbaatar, Jason Weston

TL;DR
This paper introduces an iterative preference optimization method that enhances reasoning capabilities in language models by focusing on winning reasoning steps, leading to significant accuracy improvements on reasoning benchmarks.
Contribution
It develops a novel iterative approach using a modified DPO loss to improve reasoning performance without additional external data.
Findings
Significant accuracy improvements on GSM8K, MATH, and ARC-Challenge datasets.
Outperforms other Llama-2-based models without extra sourced datasets.
Achieves 81.6% accuracy on GSM8K with iterative preference optimization.
Abstract
Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMulti-Criteria Decision Making · Data Management and Algorithms · Constraint Satisfaction and Optimization
MethodsDirect Preference Optimization
