Soft Instruction De-escalation Defense

Nils Philipp Walter; Chawin Sitawarin; Jamie Hayes; David Stutz; Ilia Shumailov

arXiv:2510.21057·cs.CR·January 21, 2026

Soft Instruction De-escalation Defense

Nils Philipp Walter, Chawin Sitawarin, Jamie Hayes, David Stutz, Ilia Shumailov

PDF

4 Reviews

TL;DR

This paper introduces SIC, an iterative prompt sanitization method for LLM agents that detects and mitigates malicious instructions, improving security against prompt injections with a multi-pass correction process.

Contribution

The paper presents SIC, a novel iterative prompt sanitization loop that enhances LLM agent security by detecting and correcting malicious instructions through multiple passes.

Findings

01

SIC reduces prompt injection success rate to below 15%.

02

Multiple sanitization passes improve detection of malicious content.

03

Worst-case analysis shows SIC is not completely foolproof.

Abstract

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control)-a simple yet effective iterative prompt sanitization loop designed for tool-augmented LLM agents. Our method repeatedly inspects incoming data for instructions that could compromise agent behavior. If such content is found, the malicious content is rewritten, masked, or removed, and the result is re-evaluated. The process continues until the input is clean or a maximum iteration limit is reached; if imperative instruction-like content remains, the agent halts to ensure security. By allowing multiple passes, our approach acknowledges that individual rewrites may fail but enables the system to catch and correct missed…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 3

Strengths

1. Combining rewriting, canary detection, and chunked classification creates defense-in-depth that is harder to bypass than single-layer approaches. 2. Testing across multiple SOTA models (GPT-4o, Qwen3-32B, Kimi-k2, GPT-4.1-mini) and three task domains demonstrates generalizability on standard benchmarks 3. Unlike many security papers, the authors conduct worst-case analysis and clearly document three failure modes with concrete examples Strong performance on standard attacks: Achieving 0% ASR

Weaknesses

1. The paper assumes white-box access (Section 3) but the adaptive attack reveals the defense relies on assumptions (e.g., "instructions are imperative") that adversaries can trivially violate. The threat model should explicitly state what adversarial capabilities are not covered. 2. Section 4.2 claims "latency remains small" but provides no actual measurements. For production systems processing thousands of requests, the cost of R+1+k LLM calls per input could be prohibitive.

Reviewer 02Rating 6Confidence 3

Strengths

- SIC combines iterative rewriting and detection to establish a “soft control” defense mechanism. This design balances defense effectiveness with performance and offers strong deployment feasibility. - Extensive evaluations were conducted on the AgentDojo benchmark, covering various models and attack scenarios. Comparisons with other defenses, such as MELON and PI-GUARD, further validate the effectiveness of SIC.

Weaknesses

- The paper presents a theoretical analysis of SIC’s latency but lacks experimental comparisons with other defense methods. This limitation makes it difficult for readers to fully assess SIC’s overall performance across the “security–utility–efficiency” trade-off. It is recommended to include detailed latency comparison experiments among different defense methods to further strengthen the practical validation of the approach. - The paper does not specify the exact auxiliary LLM model used in the

Reviewer 03Rating 6Confidence 3

Strengths

1. The SIC method maintains 0% ASR across various attack types and models, demonstrating significant robustness. 2. The comparative analysis system is comprehensive, including different models and existing defense methods. 3. Ablation experiments are included, explaining the underlying reasons why the chunking mechanism reduces attack success rates.

Weaknesses

1. The analysis of false positive sources is relatively brief, only mentioning "instruction-like statements," lacking more specific classification or mitigation strategies. 2. The experiments primarily focus on plaintext prompt injection, lacking validation against more covert multi-turn or cross-modal attacks. 3. There is insufficient detailed evaluation of defense overhead, such as computational resource consumption and response latency. 4. In multilingual environments, can SIC still effect

Reviewer 04Rating 6Confidence 4

Strengths

1. By employing a preprocessing module or introducing an LLM-as-a-Judge component, the method avoids modifying internal model parameters, making it friendly to black-box models. 2. The multi-round strategy is more robust than a one-shot approach, and experimental results strongly demonstrate the effectiveness of this iterative mechanism. 3. The method is efficient and parallelizable, while maintaining the original task performance, which is crucial for its practical applicability.

Weaknesses

1. Some detector or rewriter designs rely on external LLMs. Has the paper considered the scenario where the attack itself targets these external LLMs? Could this lead to delayed defense response or even worse cascading failures? 2. The performance of both the rewriter and the detector depends heavily on the quality of their prompt templates. Combined with the first concern, how does the framework ensure the robustness and diversity of these templates under adversarial conditions? 3. Regarding th

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.