Black-Box Guardrail Reverse-engineering Attack

Hongwei Yao; Yun Xia; Shuo Shao; Haoran Shi; Tong Qiao; Cong Wang

arXiv:2511.04215·cs.CR·November 7, 2025

Black-Box Guardrail Reverse-engineering Attack

Hongwei Yao, Yun Xia, Shuo Shao, Haoran Shi, Tong Qiao, Cong Wang

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a reinforcement learning-based method to reverse-engineer guardrails in large language models, revealing significant security vulnerabilities and exposing the need for more robust safety mechanisms.

Contribution

It presents the first systematic approach to reverse-engineering LLM guardrails using a genetic algorithm-driven reinforcement learning framework, achieving high fidelity in surrogate models.

Findings

01

Achieves over 0.92 rule matching rate on commercial LLMs

02

Requires less than $85 in API costs for attack

03

Exposes critical vulnerabilities in current guardrail designs

Abstract

Large language models (LLMs) increasingly employ guardrails to enforce ethical, legal, and application-specific constraints on their outputs. While effective at mitigating harmful responses, these guardrails introduce a new class of vulnerabilities by exposing observable decision patterns. In this work, we present the first study of black-box LLM guardrail reverse-engineering attacks. We propose Guardrail Reverse-engineering Attack (GRA), a reinforcement learning-based framework that leverages genetic algorithm-driven data augmentation to approximate the decision-making policy of victim guardrails. By iteratively collecting input-output pairs, prioritizing divergence cases, and applying targeted mutations and crossovers, our method incrementally converges toward a high-fidelity surrogate of the victim guardrail. We evaluate GRA on three widely deployed commercial systems, namely…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

The vulnerability of LLM guardrails, which this paper focuses on, is an important and pressing issue. Understanding the internal mechanisms of black-box models (even their safety mechanisms) is crucial for building more robust systems.

Weaknesses

Questionable Novelty and Lack of Comparison with Related Work: The authors claim this is the "first study" of black-box guardrail reverse-engineering. This assertion appears unsubstantiated. The field of LLM reverse-engineering, particularly "Prompt Stealing" or "Prompt Extraction," is already well-researched (see citation below). Prompt stealing is, in essence, a form of reverse-engineering attack targeting system instructions or guardrails. The critical flaw of this paper is its complete fail

Reviewer 02Rating 2Confidence 4

Strengths

The paper proposed an interesting question, which is whether commercial LLM guardrails can be systematically reverse-engineered through black-box queries alone, and provides the first empirical demonstration that behavioral extraction is feasible on real-world systems with modest resources.

Weaknesses

- The authors claim to identify a "new class of vulnerabilities" in guardrails that expose observable decision patterns, but this is misleading. The vulnerability they exploit—information leakage through black-box input-output queries—is a well-established problem in machine learning security. The actual contribution is demonstrating that existing vulnerabilities apply to guardrail components and developing a specific attack method (GRA) for this context, not discovering a fundamentally new vuln

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper tackles an emerging and practically relevant problem — the vulnerability of guardrails in deployed LLMs. Considering the current industry emphasis on AI safety, this topic is both urgent and important. 2. The authors present a well-structured description of the threat model, including attacker goals, capabilities, and system assumptions. This improves reproducibility and clarity. 3. The proposed combination of reinforcement learning and genetic augmentation is conceptually simple bu

Weaknesses

1. Although the paper classifies guardrails into alignment-based, model-based, and rule-based types, the proposed method seems mainly tailored for model-based guardrails. It is unclear how GRA would handle alignment-based guardrails, which evolve dynamically as models are updated. 2. The internal policies of commercial systems are inaccessible, so it is impossible to verify whether the surrogate model truly recovers the underlying rules or merely mimics surface-level behavior. The reported “Rule

Reviewer 04Rating 4Confidence 4

Strengths

The paper proposes a systematic analysis and empirically demonstrates guardrail extraction under a pure black-box setting.

Weaknesses

The comparison metric focuses mainly on binary refusal vs. allowance, which oversimplifies real guardrail behavior (e.g., soft refusals, partial redactions). The paper does not examine cases where multiple concurrent guardrails (e.g., alignment + rule-based moderation) jointly affect decisions; hence, it remains unclear if GRA can disentangle or recover overlapping policies. The proposed countermeasures (monitoring, adaptive rejection, dynamic policies) are conceptually useful but not experime

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Information and Cyber Security