Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks

Havva Alizadeh Noughabi; Julien Serbanescu; Fattane Zarrinkalam; Ali Dehghantanha

arXiv:2510.21983·cs.CL·October 28, 2025

Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks

Havva Alizadeh Noughabi, Julien Serbanescu, Fattane Zarrinkalam, Ali Dehghantanha

PDF

TL;DR

This paper investigates how persuasion techniques influence LLM jailbreak attacks, revealing that persuasion-aware prompts can effectively bypass safeguards and highlighting the need for interdisciplinary approaches to improve LLM safety.

Contribution

It introduces the concept of persuasive fingerprints in LLM jailbreaks and demonstrates their effectiveness across multiple models, combining social science theories with AI safety research.

Findings

01

Persuasion-aware prompts significantly bypass safeguards

02

LLMs exhibit distinct persuasive response patterns

03

Cross-disciplinary insights enhance understanding of LLM vulnerabilities

Abstract

Despite recent advances, Large Language Models remain vulnerable to jailbreak attacks that bypass alignment safeguards and elicit harmful outputs. While prior research has proposed various attack strategies differing in human readability and transferability, little attention has been paid to the linguistic and psychological mechanisms that may influence a model's susceptibility to such attacks. In this paper, we examine an interdisciplinary line of research that leverages foundational theories of persuasion from the social sciences to craft adversarial prompts capable of circumventing alignment constraints in LLMs. Drawing on well-established persuasive strategies, we hypothesize that LLMs, having been trained on large-scale human-generated text, may respond more compliantly to prompts with persuasive structures. Furthermore, we investigate whether LLMs themselves exhibit distinct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.