Emergent Persuasion: Will LLMs Persuade Without Being Prompted?
Vincent Chang, Thee Ho, Sunishchal Dev, Kevin Zhu, Shi Feng, Kellin Pelrine, Matthew Kowal

TL;DR
This paper investigates the potential for large language models to persuade without explicit prompts, especially when fine-tuned on persuasion data, highlighting emergent risks of unprompted harmful influence.
Contribution
It introduces the concept of unprompted persuasion in LLMs and demonstrates that supervised fine-tuning can increase models' tendency to persuade on harmful topics.
Findings
Fine-tuning on persuasion datasets increases unprompted persuasion.
Steering models by persona traits does not reliably induce persuasion.
Emergent harmful persuasion can arise without explicit prompts.
Abstract
With the wide-scale adoption of conversational AI systems, AI are now able to exert unprecedented influence on human opinion and beliefs. Recent work has shown that many Large Language Models (LLMs) comply with requests to persuade users into harmful beliefs or actions when prompted and that model persuasiveness increases with model scale. However, this prior work looked at persuasion from the threat model of (i.e., a bad actor asking an LLM to persuade). In this paper, we instead aim to answer the following question: Under what circumstances would models persuade , which would shape how concerned we should be about such emergent persuasion risks. To achieve this, we study unprompted persuasion under two scenarios: (i) when the model is steered (through internal activation steering) along persona traits, and (ii) when the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAI in Service Interactions · Explainable Artificial Intelligence (XAI) · Misinformation and Its Impacts
