LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training
Yuyang Gong, Zihao Wang, Jiawei Liu, XiaoFeng Wang

TL;DR
LocalAlign introduces a novel adversarial training method to improve prompt injection defenses by generating near-target adversarial examples, resulting in more robust language model responses against malicious prompts.
Contribution
The paper proposes LocalAlign, a new approach that automatically creates near-correct adversarial examples to enhance prompt injection robustness in language models.
Findings
LocalAlign generates effective adversarial examples with a single inference step.
The method enforces a tighter boundary around correct responses, improving defense generalization.
A margin-aware algorithm prioritizes training on samples closer to the correct response.
Abstract
Large language models are increasingly embedded into systems that interact with user data, retrieved web content, and external tools, creating a new attack surface: prompt injection, where malicious commands embedded in untrusted data override the trusted command and induce unintended behavior. Existing defenses mainly rely on fine-tuning the model to preserve an explicit boundary between trusted commands and the untrusted data portion, so that the model learns to prioritize the trusted field and ignore malicious commands in data. However, we observe that while these defenses can block obviously malicious responses caused by injected commands, they generalize poorly to real-world scenarios where the model's response to the injected command is much nearer to the correct response. This is because existing methods typically train against only a fixed set of hand-crafted attack targets,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
