Direct Multi-Turn Preference Optimization for Language Agents
Wentao Shi, Mengqi Yuan, Junkang Wu, Qifan Wang, Fuli Feng

TL;DR
This paper introduces DMPO, a novel loss function for multi-turn language agents that improves preference optimization by addressing partition function challenges and length disparities, validated through extensive experiments.
Contribution
It proposes DMPO, a new loss function that enhances multi-turn preference optimization by modifying the RL objective and incorporating length normalization.
Findings
DMPO outperforms existing methods on three multi-turn agent datasets.
Theoretical analysis supports the effectiveness of the proposed loss.
Extensive experiments demonstrate DMPO's superiority in preference optimization.
Abstract
Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
MethodsDirect Preference Optimization
