Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars
Yu Yan, Sheng Sun, Junqi Tong, Min Liu, and Qi Li

TL;DR
This paper introduces AVATAR, a novel metaphor-based attack framework that exploits LLMs' imaginative abilities to bypass safety measures, revealing a significant security vulnerability and achieving high success rates across multiple models.
Contribution
The study presents AVATAR, the first framework leveraging metaphorical avatars to effectively jailbreak LLMs, exposing their vulnerability to adversarial metaphors and highlighting the need for improved defenses.
Findings
AVATAR achieves state-of-the-art attack success rates.
It demonstrates the transferability of metaphor-based jailbreaks.
The study reveals inherent vulnerabilities in LLMs' imaginative capabilities.
Abstract
Metaphor serves as an implicit approach to convey information, while enabling the generalized comprehension of complex subjects. However, metaphor can potentially be exploited to bypass the safety alignment mechanisms of Large Language Models (LLMs), leading to the theft of harmful knowledge. In our study, we introduce a novel attack framework that exploits the imaginative capacity of LLMs to achieve jailbreaking, the J\underline{\textbf{A}}ilbreak \underline{\textbf{V}}ia \underline{\textbf{A}}dversarial Me\underline{\textbf{TA}} -pho\underline{\textbf{R}} (\textit{AVATAR}). Specifically, to elicit the harmful response, AVATAR extracts harmful entities from a given harmful target and maps them to innocuous adversarial entities based on LLM's imagination. Then, according to these metaphors, the harmful target is nested within human-like interaction for jailbreaking adaptively.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Artificial Intelligence in Law · Artificial Intelligence in Games
