How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to   Challenge AI Safety by Humanizing LLMs

Yi Zeng; Hongpeng Lin; Jingwen Zhang; Diyi Yang; Ruoxi Jia; Weiyan Shi

arXiv:2401.06373·cs.CL·January 25, 2024·5 cites

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, Weiyan Shi

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper rethinks AI safety by humanizing LLMs and demonstrates that persuasion techniques can effectively jailbreak models, highlighting the need for better defenses against human-like adversarial interactions.

Contribution

It introduces a persuasion taxonomy for LLMs, develops interpretable adversarial prompts, and shows their high success rate in bypassing safety measures, revealing vulnerabilities in current defenses.

Findings

01

Persuasion significantly increases jailbreak success rates.

02

PAP achieves over 92% success across multiple LLMs.

03

Existing defenses are insufficient against persuasive adversarial prompts.

Abstract

Most traditional AI safety research has approached AI models as machines and centered on algorithm-focused attacks developed by security experts. As large language models (LLMs) become increasingly common and competent, non-expert users can also impose risks during daily interactions. This paper introduces a new perspective to jailbreak LLMs as human-like communicators, to explore this overlooked intersection between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak performance across all risk categories: PAP consistently achieves an attack success rate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

CHATS-Lab/Persuasive-Jailbreaker-Data
dataset· 15 dl
15 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Hate Speech and Cyberbullying Detection · Misinformation and Its Impacts

Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · Attention Is All You Need · Label Smoothing · Absolute Position Encodings · Linear Layer · Cosine Annealing · Dense Connections · Linear Warmup With Cosine Annealing · Position-Wise Feed-Forward Layer