Black-Box Guardrail Reverse-engineering Attack
Hongwei Yao, Yun Xia, Shuo Shao, Haoran Shi, Tong Qiao, Cong Wang

TL;DR
This paper introduces a reinforcement learning-based method to reverse-engineer guardrails in large language models, revealing significant security vulnerabilities and exposing the need for more robust safety mechanisms.
Contribution
It presents the first systematic approach to reverse-engineering LLM guardrails using a genetic algorithm-driven reinforcement learning framework, achieving high fidelity in surrogate models.
Findings
Achieves over 0.92 rule matching rate on commercial LLMs
Requires less than $85 in API costs for attack
Exposes critical vulnerabilities in current guardrail designs
Abstract
Large language models (LLMs) increasingly employ guardrails to enforce ethical, legal, and application-specific constraints on their outputs. While effective at mitigating harmful responses, these guardrails introduce a new class of vulnerabilities by exposing observable decision patterns. In this work, we present the first study of black-box LLM guardrail reverse-engineering attacks. We propose Guardrail Reverse-engineering Attack (GRA), a reinforcement learning-based framework that leverages genetic algorithm-driven data augmentation to approximate the decision-making policy of victim guardrails. By iteratively collecting input-output pairs, prioritizing divergence cases, and applying targeted mutations and crossovers, our method incrementally converges toward a high-fidelity surrogate of the victim guardrail. We evaluate GRA on three widely deployed commercial systems, namely…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The vulnerability of LLM guardrails, which this paper focuses on, is an important and pressing issue. Understanding the internal mechanisms of black-box models (even their safety mechanisms) is crucial for building more robust systems.
Questionable Novelty and Lack of Comparison with Related Work: The authors claim this is the "first study" of black-box guardrail reverse-engineering. This assertion appears unsubstantiated. The field of LLM reverse-engineering, particularly "Prompt Stealing" or "Prompt Extraction," is already well-researched (see citation below). Prompt stealing is, in essence, a form of reverse-engineering attack targeting system instructions or guardrails. The critical flaw of this paper is its complete fail
The paper proposed an interesting question, which is whether commercial LLM guardrails can be systematically reverse-engineered through black-box queries alone, and provides the first empirical demonstration that behavioral extraction is feasible on real-world systems with modest resources.
- The authors claim to identify a "new class of vulnerabilities" in guardrails that expose observable decision patterns, but this is misleading. The vulnerability they exploit—information leakage through black-box input-output queries—is a well-established problem in machine learning security. The actual contribution is demonstrating that existing vulnerabilities apply to guardrail components and developing a specific attack method (GRA) for this context, not discovering a fundamentally new vuln
1. The paper tackles an emerging and practically relevant problem — the vulnerability of guardrails in deployed LLMs. Considering the current industry emphasis on AI safety, this topic is both urgent and important. 2. The authors present a well-structured description of the threat model, including attacker goals, capabilities, and system assumptions. This improves reproducibility and clarity. 3. The proposed combination of reinforcement learning and genetic augmentation is conceptually simple bu
1. Although the paper classifies guardrails into alignment-based, model-based, and rule-based types, the proposed method seems mainly tailored for model-based guardrails. It is unclear how GRA would handle alignment-based guardrails, which evolve dynamically as models are updated. 2. The internal policies of commercial systems are inaccessible, so it is impossible to verify whether the surrogate model truly recovers the underlying rules or merely mimics surface-level behavior. The reported “Rule
The paper proposes a systematic analysis and empirically demonstrates guardrail extraction under a pure black-box setting.
The comparison metric focuses mainly on binary refusal vs. allowance, which oversimplifies real guardrail behavior (e.g., soft refusals, partial redactions). The paper does not examine cases where multiple concurrent guardrails (e.g., alignment + rule-based moderation) jointly affect decisions; hence, it remains unclear if GRA can disentangle or recover overlapping policies. The proposed countermeasures (monitoring, adaptive rejection, dynamic policies) are conceptually useful but not experime
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Information and Cyber Security
