How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, Weiyan Shi

TL;DR
This paper rethinks AI safety by humanizing LLMs and demonstrates that persuasion techniques can effectively jailbreak models, highlighting the need for better defenses against human-like adversarial interactions.
Contribution
It introduces a persuasion taxonomy for LLMs, develops interpretable adversarial prompts, and shows their high success rate in bypassing safety measures, revealing vulnerabilities in current defenses.
Findings
Persuasion significantly increases jailbreak success rates.
PAP achieves over 92% success across multiple LLMs.
Existing defenses are insufficient against persuasive adversarial prompts.
Abstract
Most traditional AI safety research has approached AI models as machines and centered on algorithm-focused attacks developed by security experts. As large language models (LLMs) become increasingly common and competent, non-expert users can also impose risks during daily interactions. This paper introduces a new perspective to jailbreak LLMs as human-like communicators, to explore this overlooked intersection between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak performance across all risk categories: PAP consistently achieves an attack success rate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Hate Speech and Cyberbullying Detection · Misinformation and Its Impacts
Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Label Smoothing · Absolute Position Encodings · Linear Layer · Cosine Annealing · Dense Connections · Linear Warmup With Cosine Annealing · Position-Wise Feed-Forward Layer
