Iterative Reasoning Preference Optimization

Richard Yuanzhe Pang; Weizhe Yuan; Kyunghyun Cho; He He; Sainbayar; Sukhbaatar; Jason Weston

arXiv:2404.19733·cs.CL·June 27, 2024·2 cites

Iterative Reasoning Preference Optimization

Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar, Sukhbaatar, Jason Weston

PDF

Open Access 1 Video

TL;DR

This paper introduces an iterative preference optimization method that enhances reasoning capabilities in language models by focusing on winning reasoning steps, leading to significant accuracy improvements on reasoning benchmarks.

Contribution

It develops a novel iterative approach using a modified DPO loss to improve reasoning performance without additional external data.

Findings

01

Significant accuracy improvements on GSM8K, MATH, and ARC-Challenge datasets.

02

Outperforms other Llama-2-based models without extra sourced datasets.

03

Achieves 81.6% accuracy on GSM8K with iterative preference optimization.

Abstract

Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Iterative Reasoning Preference Optimization· slideslive

Taxonomy

TopicsMulti-Criteria Decision Making · Data Management and Algorithms · Constraint Satisfaction and Optimization

MethodsDirect Preference Optimization