Dagger Behind Smile: Fool LLMs with a Happy Ending Story
Xurui Song, Zhixin Xie, Shuo Huai, Jiayi Kong, Jun Luo

TL;DR
This paper introduces the Happy Ending Attack (HEA), a novel prompt-based jailbreak method that exploits LLMs' responsiveness to positive prompts, achieving high success rates with minimal interactions.
Contribution
The paper proposes HEA, a new efficient and effective jailbreak technique using positive prompts, demonstrating its success across multiple state-of-the-art LLMs.
Findings
HEA achieves an 88.79% success rate on average.
HEA requires only up to two turns to succeed.
HEA is effective against models like GPT-4o, Llama3-70b, and Gemini-pro.
Abstract
The wide adoption of Large Language Models (LLMs) has attracted significant attention from attacks, where adversarial prompts crafted through optimization or manual design exploit LLMs to generate malicious contents. However, optimization-based attacks have limited efficiency and transferability, while existing manual designs are either easily detectable or demand intricate interactions with LLMs. In this paper, we first point out a novel perspective for jailbreak attacks: LLMs are more responsive to prompts. Based on this, we deploy Happy Ending Attack (HEA) to wrap up a malicious request in a scenario template involving a positive prompt formed mainly via a , it thus fools LLMs into jailbreaking either immediately or at a follow-up malicious request. This has made HEA both efficient and effective, as it requires only up…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games
MethodsSoftmax · Attention Is All You Need
