Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification
Yanxi Li, Ruocheng Shan

TL;DR
This paper proposes Label Disguise Defense (LDD), a lightweight, model-agnostic method that uses semantically disguised labels to protect large language models from prompt injection attacks in sentiment classification tasks.
Contribution
It introduces a novel label disguise strategy that leverages semantic transformations to prevent adversarial prompt injections without retraining models.
Findings
LDD restores a significant portion of accuracy lost to prompt injection.
Multiple alias pairs can outperform the baseline in adversarial settings.
Semantically aligned labels provide stronger robustness than unrelated symbols.
Abstract
Large language models are increasingly used for text classification tasks such as sentiment analysis, yet their reliance on natural language prompts exposes them to prompt injection attacks. In particular, class-directive injections exploit knowledge of the model's label set (e.g., positive vs. negative) to override its intended behavior through adversarial instructions. Existing defenses, such as detection-based filters, instruction hierarchies, and signed prompts, either require model retraining or remain vulnerable to obfuscation. This paper introduces Label Disguise Defense (LDD), a lightweight and model-agnostic strategy that conceals true labels by replacing them with semantically transformed or unrelated alias labels(e.g., blue vs. yellow). The model learns these new label mappings implicitly through few-shot demonstrations, preventing direct correspondence between injected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Misinformation and Its Impacts
