SDPO: Segment-Level Direct Preference Optimization for Social Agents

Aobo Kong; Wentao Ma; Shiwan Zhao; Yongbin Li; Yuchuan Wu; Ke Wang,; Xiaoqian Liu; Qicheng Li; Yong Qin; Fei Huang

arXiv:2501.01821·cs.AI·February 28, 2025

SDPO: Segment-Level Direct Preference Optimization for Social Agents

Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang,, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

SDPO introduces a segment-level optimization method for social agents that enhances multi-turn dialogue alignment, reducing training noise and improving social behavior compared to existing methods.

Contribution

It proposes a novel segment-level DPO approach with a theoretical foundation, outperforming prior session-level methods in multi-turn social dialogue tasks.

Findings

01

SDPO-tuned agents outperform existing DPO methods.

02

SDPO reduces training noise in multi-turn alignment.

03

SDPO achieves superior performance on the SOTOPIA benchmark.

Abstract

Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across various agent tasks. However, standard DPO focuses solely on individual turns, which limits its effectiveness in multi-turn social interactions. Several DPO-based multi-turn alignment methods with session-level data have shown potential in addressing this problem.While these methods consider multiple turns across entire sessions, they are often overly coarse-grained, introducing training noise, and lack robust theoretical support. To resolve these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which dynamically select key segments within interactions to optimize multi-turn agent behavior. SDPO minimizes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alibabaresearch/damo-convai
pytorchOfficial

Datasets

Tongyi-ConvAI/SDPO
dataset· 28 dl
28 dl

Videos

SDPO: Segment-Level Direct Preference Optimization for Social Agents· underline

Taxonomy

TopicsReinforcement Learning in Robotics · Data Management and Algorithms · Multi-Agent Systems and Negotiation