Special-Character Adversarial Attacks on Open-Source Language Model
Ephraiem Sarabamoun

TL;DR
This paper investigates the security vulnerabilities of open-source large language models to special-character adversarial attacks, revealing critical weaknesses that can bypass safety measures and cause undesirable outputs.
Contribution
It systematically evaluates various special-character attack methods on multiple open-source LLMs, highlighting their susceptibility and exposing failure modes.
Findings
All models are vulnerable to special-character attacks.
Successful jailbreaks and hallucinations occur across models.
Vulnerabilities increase with model size.
Abstract
Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks, yet their vulnerability to character-level adversarial manipulations presents significant security challenges for real-world deployments. This paper presents a study of different special character attacks including unicode, homoglyph, structural, and textual encoding attacks aimed at bypassing safety mechanisms. We evaluate seven prominent open-source models ranging from 3.8B to 32B parameters on 4,000+ attack attempts. These experiments reveal critical vulnerabilities across all model sizes, exposing failure modes that include successful jailbreaks, incoherent outputs, and unrelated hallucinations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
