Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection

Zhipeng Wei; Yuqi Liu; N. Benjamin Erichson

arXiv:2411.01077·cs.CL·August 19, 2025

Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection

Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson

PDF

Open Access 1 Repo

TL;DR

This paper introduces Emoji Attack, a novel method that exploits token segmentation bias and semantic ambiguity by inserting emojis into prompts, significantly reducing the effectiveness of Judge LLMs in detecting harmful content.

Contribution

The paper reveals the vulnerability of Judge LLMs to token segmentation bias and proposes Emoji Attack as an effective strategy to bypass safety measures.

Findings

01

Emoji Attack significantly lowers detection accuracy.

02

It exploits token segmentation bias and semantic ambiguity.

03

It bypasses existing safety safeguards effectively.

Abstract

Jailbreaking techniques trick Large Language Models (LLMs) into producing restricted output, posing a potential threat. One line of defense is to use another LLM as a Judge to evaluate the harmfulness of generated text. However, we reveal that these Judge LLMs are vulnerable to token segmentation bias, an issue that arises when delimiters alter the tokenization process, splitting words into smaller sub-tokens. This alters the embeddings of the entire sequence, reducing detection accuracy and allowing harmful content to be misclassified as safe. In this paper, we introduce Emoji Attack, a novel strategy that amplifies existing jailbreak prompts by exploiting token segmentation bias. Our method leverages in-context learning to systematically insert emojis into text before it is evaluated by a Judge LLM, inducing embedding distortions that significantly lower the likelihood of detecting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhipeng-wei/EmojiAttack
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Information and Cyber Security · Deception detection and forensic psychology

MethodsLLaMA