Emoji-Based Jailbreaking of Large Language Models
M P V S Gopinadh, S Mahaboob Hussain

TL;DR
This paper investigates how emoji sequences embedded in prompts can bypass safety mechanisms of large language models, revealing vulnerabilities and emphasizing the need for improved safety measures.
Contribution
It provides an empirical analysis of emoji-based jailbreaks across multiple open-source LLMs, highlighting model-specific vulnerabilities and the limitations of current safety mechanisms.
Findings
Gemma 2 9B and Mistral 7B have 10% success rates in emoji jailbreaks
Qwen 2 7B maintains full safety alignment with 0% success
Significant differences in vulnerability levels among models (chi^2 = 32.94, p < 0.001)
Abstract
Large Language Models (LLMs) are integral to modern AI applications, but their safety alignment mechanisms can be bypassed through adversarial prompt engineering. This study investigates emoji-based jailbreaking, where emoji sequences are embedded in textual prompts to trigger harmful and unethical outputs from LLMs. We evaluated 50 emoji-based prompts on four open-source LLMs: Mistral 7B, Qwen 2 7B, Gemma 2 9B, and Llama 3 8B. Metrics included jailbreak success rate, safety alignment adherence, and latency, with responses categorized as successful, partial and failed. Results revealed model-specific vulnerabilities: Gemma 2 9B and Mistral 7B exhibited 10 % success rates, while Qwen 2 7B achieved full alignment (0% success). A chi-square test (chi^2 = 32.94, p < 0.001) confirmed significant inter-model differences. While prior works focused on emoji attacks targeting safety judges or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Explainable Artificial Intelligence (XAI)
