LLM-Agnostic Semantic Representation Attack
Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Tairan Huang, Shaohui Mei, Lap-Pui Chau

TL;DR
This paper introduces a novel semantic representation attack (SRA) that targets LLMs by focusing on malicious semantic content rather than exact text, improving attack success, transferability, and stealth.
Contribution
The paper proposes a new LLM-agnostic adversarial attack paradigm based on semantic representations, with theoretical guarantees and an effective search algorithm.
Findings
Achieves 99.71% attack success rate across 26 LLMs
Demonstrates strong transferability and stealth of the attack
Provides theoretical bounds linking semantic coherence to attack effectiveness
Abstract
Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting adversarial prompts. Predominant token-level optimization methods primarily rely on optimizing for exact affirmative templates (e.g., ``\textit{Sure, here is...}''). However, these paradigms frequently encounter bottlenecks such as suboptimal convergence, compromised prompt naturalness, and poor cross-model generalization. To address these limitations, we propose Semantic Representation Attack (SRA), a novel LLM-agnostic paradigm that fundamentally reconceptualizes adversarial objectives from exact textual targeting to malicious semantic representations. Theoretically, we establish the semantic Coherence-Convergence Relationship and derive a Cross-Model Semantic Generalization bound, proving that maintaining semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
