LLM-Agnostic Semantic Representation Attack

Jiawei Lian; Jianhong Pan; Lefan Wang; Yi Wang; Tairan Huang; Shaohui Mei; Lap-Pui Chau

arXiv:2605.08898·cs.CL·May 12, 2026

LLM-Agnostic Semantic Representation Attack

Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Tairan Huang, Shaohui Mei, Lap-Pui Chau

PDF

TL;DR

This paper introduces a novel semantic representation attack (SRA) that targets LLMs by focusing on malicious semantic content rather than exact text, improving attack success, transferability, and stealth.

Contribution

The paper proposes a new LLM-agnostic adversarial attack paradigm based on semantic representations, with theoretical guarantees and an effective search algorithm.

Findings

01

Achieves 99.71% attack success rate across 26 LLMs

02

Demonstrates strong transferability and stealth of the attack

03

Provides theoretical bounds linking semantic coherence to attack effectiveness

Abstract

Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting adversarial prompts. Predominant token-level optimization methods primarily rely on optimizing for exact affirmative templates (e.g., ``\textit{Sure, here is...}''). However, these paradigms frequently encounter bottlenecks such as suboptimal convergence, compromised prompt naturalness, and poor cross-model generalization. To address these limitations, we propose Semantic Representation Attack (SRA), a novel LLM-agnostic paradigm that fundamentally reconceptualizes adversarial objectives from exact textual targeting to malicious semantic representations. Theoretically, we establish the semantic Coherence-Convergence Relationship and derive a Cross-Model Semantic Generalization bound, proving that maintaining semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.