Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection
Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson

TL;DR
This paper introduces Emoji Attack, a novel method that exploits token segmentation bias and semantic ambiguity by inserting emojis into prompts, significantly reducing the effectiveness of Judge LLMs in detecting harmful content.
Contribution
The paper reveals the vulnerability of Judge LLMs to token segmentation bias and proposes Emoji Attack as an effective strategy to bypass safety measures.
Findings
Emoji Attack significantly lowers detection accuracy.
It exploits token segmentation bias and semantic ambiguity.
It bypasses existing safety safeguards effectively.
Abstract
Jailbreaking techniques trick Large Language Models (LLMs) into producing restricted output, posing a potential threat. One line of defense is to use another LLM as a Judge to evaluate the harmfulness of generated text. However, we reveal that these Judge LLMs are vulnerable to token segmentation bias, an issue that arises when delimiters alter the tokenization process, splitting words into smaller sub-tokens. This alters the embeddings of the entire sequence, reducing detection accuracy and allowing harmful content to be misclassified as safe. In this paper, we introduce Emoji Attack, a novel strategy that amplifies existing jailbreak prompts by exploiting token segmentation bias. Our method leverages in-context learning to systematically insert emojis into text before it is evaluated by a Judge LLM, inducing embedding distortions that significantly lower the likelihood of detecting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Information and Cyber Security · Deception detection and forensic psychology
MethodsLLaMA
