Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks
Havva Alizadeh Noughabi, Julien Serbanescu, Fattane Zarrinkalam, Ali Dehghantanha

TL;DR
This paper investigates how persuasion techniques influence LLM jailbreak attacks, revealing that persuasion-aware prompts can effectively bypass safeguards and highlighting the need for interdisciplinary approaches to improve LLM safety.
Contribution
It introduces the concept of persuasive fingerprints in LLM jailbreaks and demonstrates their effectiveness across multiple models, combining social science theories with AI safety research.
Findings
Persuasion-aware prompts significantly bypass safeguards
LLMs exhibit distinct persuasive response patterns
Cross-disciplinary insights enhance understanding of LLM vulnerabilities
Abstract
Despite recent advances, Large Language Models remain vulnerable to jailbreak attacks that bypass alignment safeguards and elicit harmful outputs. While prior research has proposed various attack strategies differing in human readability and transferability, little attention has been paid to the linguistic and psychological mechanisms that may influence a model's susceptibility to such attacks. In this paper, we examine an interdisciplinary line of research that leverages foundational theories of persuasion from the social sciences to craft adversarial prompts capable of circumventing alignment constraints in LLMs. Drawing on well-established persuasive strategies, we hypothesize that LLMs, having been trained on large-scale human-generated text, may respond more compliantly to prompts with persuasive structures. Furthermore, we investigate whether LLMs themselves exhibit distinct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
