Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD
Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee

TL;DR
This paper introduces DuET-PD, a framework for evaluating LLMs' ability to handle persuasion in dialogues, and proposes Holistic DPO training to improve models' robustness and adaptability in knowledge and safety contexts.
Contribution
The paper presents DuET-PD for assessing persuasion dynamics and introduces Holistic DPO training, significantly enhancing LLMs' resistance to misinformation and receptiveness to corrections.
Findings
GPT-4o achieves only 27.32% accuracy in misleading persuasion scenarios.
Newer open-source models show increasing sycophancy.
Holistic DPO improves safety context accuracy from 4.21% to 76.54%.
Abstract
Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
