Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars

Yu Yan; Sheng Sun; Junqi Tong; Min Liu; and Qi Li

arXiv:2412.12145·cs.CL·February 25, 2025

Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars

Yu Yan, Sheng Sun, Junqi Tong, Min Liu, and Qi Li

PDF

Open Access

TL;DR

This paper introduces AVATAR, a novel metaphor-based attack framework that exploits LLMs' imaginative abilities to bypass safety measures, revealing a significant security vulnerability and achieving high success rates across multiple models.

Contribution

The study presents AVATAR, the first framework leveraging metaphorical avatars to effectively jailbreak LLMs, exposing their vulnerability to adversarial metaphors and highlighting the need for improved defenses.

Findings

01

AVATAR achieves state-of-the-art attack success rates.

02

It demonstrates the transferability of metaphor-based jailbreaks.

03

The study reveals inherent vulnerabilities in LLMs' imaginative capabilities.

Abstract

Metaphor serves as an implicit approach to convey information, while enabling the generalized comprehension of complex subjects. However, metaphor can potentially be exploited to bypass the safety alignment mechanisms of Large Language Models (LLMs), leading to the theft of harmful knowledge. In our study, we introduce a novel attack framework that exploits the imaginative capacity of LLMs to achieve jailbreaking, the J\underline{\textbf{A}}ilbreak \underline{\textbf{V}}ia \underline{\textbf{A}}dversarial Me\underline{\textbf{TA}} -pho\underline{\textbf{R}} (\textit{AVATAR}). Specifically, to elicit the harmful response, AVATAR extracts harmful entities from a given harmful target and maps them to innocuous adversarial entities based on LLM's imagination. Then, according to these metaphors, the harmful target is nested within human-like interaction for jailbreaking adaptively.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Artificial Intelligence in Law · Artificial Intelligence in Games