Teaching Models to Balance Resisting and Accepting Persuasion
Elias Stengel-Eskin, Peter Hase, Mohit Bansal

TL;DR
This paper introduces Persuasion-Training (PBT), a method to train large language models to both resist negative persuasion and accept positive persuasion, improving their robustness and collaborative performance.
Contribution
The paper presents PBT, a novel training approach using multi-agent dialogue trees to balance resistance and acceptance of persuasion in large language models.
Findings
PBT improves resistance to misinformation and adversarial persuasion.
PBT enhances stability and teamwork in multi-agent debates.
Models trained with PBT outperform baseline models on holistic persuasion data.
Abstract
Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT allows us to use data generated from dialogues between smaller 7-8B models for training much larger 70B models. Moreover, PBT consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSocial Media and Politics · Communication in Education and Healthcare
