ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack
Hao Li, Yankai Yang, G. Edward Suh, Ning Zhang, Chaowei Xiao

TL;DR
ReasAlign introduces a reasoning-based model-level defense that effectively detects and mitigates prompt injection attacks in large language models, maintaining high utility while significantly reducing attack success rates.
Contribution
The paper proposes ReasAlign, a novel reasoning-enhanced safety alignment method incorporating structured reasoning and test-time scoring to defend against prompt injection attacks.
Findings
ReasAlign achieves 94.6% utility on CyberSecEval2.
ReasAlign reduces attack success rate to 3.6%.
Outperforms prior defenses like Meta SecAlign.
Abstract
Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external data can hijack agent behavior. In this work, we present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks. The core idea of ReasAlign is to incorporate structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks to defend against indirect injection attacks. To further ensure reasoning logic and accuracy, we introduce a test-time scaling mechanism with a preference-optimized judge model that scores reasoning steps and selects the best trajectory. Comprehensive evaluations…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
This paper tries to address a genuine and important security problem: prompt injection attacks in LLM agents that interact with untrusted external data Paper is clear in motivation and problem setup
**Limited technical novelty** This paper takes very standard approaches like using a stronger model to generate reasoning data, and does best-of-N sampling with a judge model. **Missing related work and experimental comparisons** This paper proposes a mix of model-level and test-time scaling approach for prompt injection defense. It only cites 4-5 defense papers (primarily SecAlign series and system-level defenses) while missing significant model-level and detection-based work. In addition,
1. Methodologically simple yet highly effective—built on standard SFT and LoRA without architectural changes. 2. Structured reasoning format (Problem Analysis → Reasoning → Final Answer) provides clear interpretability and controllability. 3. Effectively balances safety and utility, outperforming prior defensive methods. 4. Maintains or even improves general task performance after alignment.
- Test-time scaling increases inference cost and latency. - Results rely on LLM-as-judge, which can introduce evaluation bias and reduce accuracy.
1. The results are strong - it gets good results on many benchmarks; compare that to the undefended model or even the previous best defense SecAlign++. 2. The authors did a thorough job testing their approach across many different benchmarks - general tasks, security-specific tests, and agent workflows. 3. This paper tackles a really important security issue that matters more and more as we deploy AI agents in the real world. Indirect prompt injection attacks are a serious threat to these syst
1. The paper has some organizational problems in writing. Section 3.2 on threat model seems quite standard and repetitive compared to prior work - it doesn't need to be in the main text. Figure 2 showing prompt injection examples is also unnecessary. More critically, Section 4 on methodology lacks a clear diagram to quickly illustrate how the approach actually works, which would be much more helpful than the redundant threat model description. 2. My main concern is whether this method can scale
- The paper targets a critical and timely vulnerability in modern LLM-based agents: indirect prompt injection. - The authors correctly identify a significant, practical weakness in existing defenses (which they categorize as “internal defenses” like SecAlign++). This is the “over-defensive” or “overkill” issue, where rigidly suppressing all external instructions severely harms utility when those instructions are benign and necessary. - The empirical results presented on the CyberSecEval2 benchma
- Baselines: The paper’s “state-of-the-art” (SOTA) claims hinge on outperforming SecAlign++. However, SecAlign++ is not a rigorously peer-reviewed, published method, which may weaken the claim of advancing the SOTA. Besides, the paper dismisses “external defenses” as simple detectors that just “halt task execution”. It fails to compare against a crucial, training-free baseline: a strong, undefended LLM guided by a simple prompt that instructs it to perform the exact same reasoning as ReasAlign.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Explainable Artificial Intelligence (XAI)
