SDPO: Segment-Level Direct Preference Optimization for Social Agents
Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang,, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang

TL;DR
SDPO introduces a segment-level optimization method for social agents that enhances multi-turn dialogue alignment, reducing training noise and improving social behavior compared to existing methods.
Contribution
It proposes a novel segment-level DPO approach with a theoretical foundation, outperforming prior session-level methods in multi-turn social dialogue tasks.
Findings
SDPO-tuned agents outperform existing DPO methods.
SDPO reduces training noise in multi-turn alignment.
SDPO achieves superior performance on the SOTOPIA benchmark.
Abstract
Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across various agent tasks. However, standard DPO focuses solely on individual turns, which limits its effectiveness in multi-turn social interactions. Several DPO-based multi-turn alignment methods with session-level data have shown potential in addressing this problem.While these methods consider multiple turns across entire sessions, they are often overly coarse-grained, introducing training noise, and lack robust theoretical support. To resolve these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which dynamically select key segments within interactions to optimize multi-turn agent behavior. SDPO minimizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Data Management and Algorithms · Multi-Agent Systems and Negotiation
