Enhancing Multi-Step Reasoning Abilities of Language Models through   Direct Q-Function Optimization

Kaixuan Ji; Guanlin Liu; Ning Dai; Qingping Yang; Renjie; Zheng; Zheng Wu; Chen Dun; Quanquan Gu; Lin Yan

arXiv:2410.09302·cs.LG·February 12, 2025·2 cites

Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization

Kaixuan Ji, Guanlin Liu, Ning Dai, Qingping Yang, Renjie, Zheng, Zheng Wu, Chen Dun, Quanquan Gu, Lin Yan

PDF

Open Access 2 Repos 2 Datasets

TL;DR

This paper introduces Direct Q-function Optimization (DQO), a novel offline reinforcement learning method that enhances multi-step reasoning in language models by formulating response generation as an MDP and optimizing a Q-function directly.

Contribution

The paper proposes DQO, a new offline RL approach using MDP formulation and SAC framework, improving multi-step reasoning in language models over prior bandit-based methods.

Findings

01

DQO outperforms previous methods on GSM8K and MATH datasets.

02

DQO effectively handles complex multi-step reasoning tasks.

03

The approach demonstrates promising results for offline RL in language model alignment.

Abstract

Reinforcement Learning (RL) plays a crucial role in aligning large language models (LLMs) with human preferences and improving their ability to perform complex tasks. However, current approaches either require significant computational resources due to the use of multiple models and extensive online sampling for training (e.g., PPO) or are framed as bandit problems (e.g., DPO, DRO), which often struggle with multi-step reasoning tasks, such as math problem solving and complex reasoning that involve long chains of thought. To overcome these limitations, we introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the language model. The MDP formulation of DQO offers structural advantages over bandit-based methods,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsDirect Preference Optimization