Special-Character Adversarial Attacks on Open-Source Language Model

Ephraiem Sarabamoun

arXiv:2508.14070·cs.CR·November 27, 2025

Special-Character Adversarial Attacks on Open-Source Language Model

Ephraiem Sarabamoun

PDF

Open Access

TL;DR

This paper investigates the security vulnerabilities of open-source large language models to special-character adversarial attacks, revealing critical weaknesses that can bypass safety measures and cause undesirable outputs.

Contribution

It systematically evaluates various special-character attack methods on multiple open-source LLMs, highlighting their susceptibility and exposing failure modes.

Findings

01

All models are vulnerable to special-character attacks.

02

Successful jailbreaks and hallucinations occur across models.

03

Vulnerabilities increase with model size.

Abstract

Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks, yet their vulnerability to character-level adversarial manipulations presents significant security challenges for real-world deployments. This paper presents a study of different special character attacks including unicode, homoglyph, structural, and textual encoding attacks aimed at bypassing safety mechanisms. We evaluate seven prominent open-source models ranging from 3.8B to 32B parameters on 4,000+ attack attempts. These experiments reveal critical vulnerabilities across all model sizes, exposing failure modes that include successful jailbreaks, incoherent outputs, and unrelated hallucinations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling