Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

Yujia Zheng; Tianhao Li; Haotian Huang; Tianyu Zeng; Jingyu Lu; Chuangxin Chu; Yuekai Huang; Ziyou Jiang; Qian Xiong; Yuyao Ge; Mingyang Li

arXiv:2508.01554·cs.CL·August 5, 2025

Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in Large Language Models

Yujia Zheng, Tianhao Li, Haotian Huang, Tianyu Zeng, Jingyu Lu, Chuangxin Chu, Yuekai Huang, Ziyou Jiang, Qian Xiong, Yuyao Ge, Mingyang Li

PDF

Open Access

TL;DR

This paper introduces PromptAnatomy, a framework that dissects prompt structures to generate interpretable adversarial examples, revealing the heterogeneous vulnerabilities of prompt components in large language models and improving robustness evaluation.

Contribution

The paper presents PromptAnatomy and ComPerturb, novel methods for dissecting prompts and generating targeted adversarial attacks, addressing the structural heterogeneity overlooked by prior approaches.

Findings

01

PromptAnatomy effectively dissects prompts into functional components.

02

ComPerturb achieves state-of-the-art attack success rates across datasets and models.

03

Prompt structure awareness enhances adversarial robustness evaluation.

Abstract

Prompt-based adversarial attacks have become an effective means to assess the robustness of large language models (LLMs). However, existing approaches often treat prompts as monolithic text, overlooking their structural heterogeneity-different prompt components contribute unequally to adversarial robustness. Prior works like PromptRobust assume prompts are value-neutral, but our analysis reveals that complex, domain-specific prompts with rich structures have components with differing vulnerabilities. To address this gap, we introduce PromptAnatomy, an automated framework that dissects prompts into functional components and generates diverse, interpretable adversarial examples by selectively perturbing each component using our proposed method, ComPerturb. To ensure linguistic plausibility and mitigate distribution shifts, we further incorporate a perplexity (PPL)-based filtering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection