Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Jialiang Fan; Shixiong Jiang; Mengyu Liu; and Fanxin Kong

arXiv:2602.16543·cs.LG·February 19, 2026

Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Jialiang Fan, Shixiong Jiang, Mengyu Liu, and Fanxin Kong

PDF

Open Access 3 Reviews

TL;DR

This paper presents a black-box adversarial attack framework that exposes vulnerabilities in safe reinforcement learning policies by using expert demonstrations and surrogate models, without needing internal policy details.

Contribution

It introduces a novel attack method that operates with limited access, combining constraint modeling and surrogate policies to challenge Safe RL systems.

Findings

01

Effective attacks on Safe RL benchmarks under limited access

02

Theoretical bounds on perturbations and attack feasibility

03

Demonstrated vulnerability of Safe RL in realistic scenarios

Abstract

Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The paper provides a practical and realistic vulnerability assessment of Safe RL by considering adversaries closer to real-world scenarios. This perspective enhances the practical value of robustness evaluation for Safe RL methods.

Weaknesses

The novelty of the proposed method is limited. It is essentially a straightforward combination of ICRL and standard gradient-based adversarial optimization. The theoretical analysis in Section 4.3 also lacks originality. Theorem 1 can be almost trivially derived from [1], and Lemmas 2, 3, and 4 follow directly from their assumptions. Although Remark 1 claims the effectiveness of estimating the Lipschitz constant, this claim is not empirically validated. Moreover, the proposed method is computat

Reviewer 02Rating 4Confidence 3

Strengths

1. Introducing an inverse-learning formulation to analyze Safe RL vulnerabilities is original and conceptually strong. 2. The method works under limited access assumptions, making it applicable to realistic black-box attack settings. 3. This work provides a theoretical analysis of the attack performance bound, which strengthens the work's contribution.

Weaknesses

1. The methodology is relatively trivial by combining ICRL to learn the constraint policy and an approximate agent policy to conduct gradient-based attack. 2. The experiment was only conducted on two toy environments, which is not general enough to state the effectiveness of the proposed method. I would like to see results on more environments. 3. The experiment only compares with other attack methods. It does not include defense baselines in the experiment.

Reviewer 03Rating 2Confidence 4

Strengths

The paper shows that it is possible to perturb an agent's states to increase its safety cost even without knowing the cost function a priori, as long as the attacker has access to a set of clean trajectories and the agent's policy and reward functions.

Weaknesses

1. The paper applies the ICRL method in [Kim et al., 2024] to estimate the agent's cost function and then leverages the standard PGD attack. The technical contribution is low. 2. While the paper claims its approach to be gradient-free, it looks like it still needs the gradient information. In Algorithm 1, line 4, the PGD method requires evaluating the gradient with respect to s', which involves the gradient of pi_E. 3. The attacker's objective is confusing. According to (4), it appears the att

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Reinforcement Learning in Robotics · Infrastructure Resilience and Vulnerability Analysis