Direct Multi-Turn Preference Optimization for Language Agents

Wentao Shi; Mengqi Yuan; Junkang Wu; Qifan Wang; Fuli Feng

arXiv:2406.14868·cs.CL·February 25, 2025

Direct Multi-Turn Preference Optimization for Language Agents

Wentao Shi, Mengqi Yuan, Junkang Wu, Qifan Wang, Fuli Feng

PDF

Open Access 1 Repo 5 Models 1 Video

TL;DR

This paper introduces DMPO, a novel loss function for multi-turn language agents that improves preference optimization by addressing partition function challenges and length disparities, validated through extensive experiments.

Contribution

It proposes DMPO, a new loss function that enhances multi-turn preference optimization by modifying the RL objective and incorporating length normalization.

Findings

01

DMPO outperforms existing methods on three multi-turn agent datasets.

02

Theoretical analysis supports the effectiveness of the proposed loss.

03

Extensive experiments demonstrate DMPO's superiority in preference optimization.

Abstract

Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

swt-user/dmpo
pytorchOfficial

Models

Videos

Direct Multi-Turn Preference Optimization for Language Agents· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems

MethodsDirect Preference Optimization