Emoji-Based Jailbreaking of Large Language Models

M P V S Gopinadh; S Mahaboob Hussain

arXiv:2601.00936·cs.CR·January 6, 2026

Emoji-Based Jailbreaking of Large Language Models

M P V S Gopinadh, S Mahaboob Hussain

PDF

Open Access

TL;DR

This paper investigates how emoji sequences embedded in prompts can bypass safety mechanisms of large language models, revealing vulnerabilities and emphasizing the need for improved safety measures.

Contribution

It provides an empirical analysis of emoji-based jailbreaks across multiple open-source LLMs, highlighting model-specific vulnerabilities and the limitations of current safety mechanisms.

Findings

01

Gemma 2 9B and Mistral 7B have 10% success rates in emoji jailbreaks

02

Qwen 2 7B maintains full safety alignment with 0% success

03

Significant differences in vulnerability levels among models (chi^2 = 32.94, p < 0.001)

Abstract

Large Language Models (LLMs) are integral to modern AI applications, but their safety alignment mechanisms can be bypassed through adversarial prompt engineering. This study investigates emoji-based jailbreaking, where emoji sequences are embedded in textual prompts to trigger harmful and unethical outputs from LLMs. We evaluated 50 emoji-based prompts on four open-source LLMs: Mistral 7B, Qwen 2 7B, Gemma 2 9B, and Llama 3 8B. Metrics included jailbreak success rate, safety alignment adherence, and latency, with responses categorized as successful, partial and failed. Results revealed model-specific vulnerabilities: Gemma 2 9B and Mistral 7B exhibited 10 % success rates, while Qwen 2 7B achieved full alignment (0% success). A chi-square test (chi^2 = 32.94, p < 0.001) confirmed significant inter-model differences. While prior works focused on emoji attacks targeting safety judges or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Explainable Artificial Intelligence (XAI)