Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning
Oswin So, Eric Yang Yu, Songyuan Zhang, Matthew Cleaveland, Mitchell Black, Chuchu Fan

TL;DR
This paper introduces Feasibility-Guided Exploration (FGE), a reinforcement learning method that identifies feasible initial conditions and learns safe policies, improving coverage in reachability tasks with unknown feasibility.
Contribution
The paper proposes FGE, a novel RL approach that simultaneously discovers feasible initial states and learns policies, addressing the challenge of unknown feasibility in reachability problems.
Findings
FGE achieves over 50% more coverage than existing methods.
FGE successfully handles challenging initial conditions in MuJoCo and Kinetix simulations.
FGE effectively identifies feasible initial conditions for safe policy learning.
Abstract
Recent advances in deep reinforcement learning (RL) have achieved strong results on high-dimensional control tasks, but applying RL to reachability problems raises a fundamental mismatch: reachability seeks to maximize the set of states from which a system remains safe indefinitely, while RL optimizes expected returns over a user-specified distribution. This mismatch can result in policies that perform poorly on low-probability states that are still within the safe set. A natural alternative is to frame the problem as a robust optimization over a set of initial conditions that specify the initial state, dynamics and safe set, but whether this problem has a solution depends on the feasibility of the specified set, which is unknown a priori. We propose Feasibility-Guided Exploration (FGE), a method that simultaneously identifies a subset of feasible initial conditions under which a safe…
Peer Reviews
Decision·ICLR 2026 Poster
- **Theoretical grounding:** The paper gives a clear problem formulation, shows equivalence between its indicator-style objective and a reachability formulation (Lemma 1), and analyzes properties of the learned feasibility classifier (Theorem 1), which underpin the empirical design choices - **Clear, practical algorithmic pipeline:** FGE is presented as an algorithm that can be dropped on top of any on-policy method (PPO in the experiments), combining feasibility learning, rejection-sampling exp
- **Notation and readability:** - The abstract and introduction are hard to read because of vague sentences. For example, there is repeated use of terms like "initial conditions" and "initial parameters" throughout without defining what they mean in the context of this paper (this is only done in the next section). I initially thought "initial conditions" meant the initial conditions of an optimisation process, and not the distribution/set of initial states. Similarly, I thought initial param
* The paper is solving a complex problem with an interesting solution, and good results * There’s a good ablation study * The exposition of results is great! I wish more papers would present the results this way. There’s a thesis of what algorithm does better than competitors and the supporting evidence.
* The paper’s flow can be improved as it is not easy to understand the problem and the contribution at the first read. While it’s not easy to write a complicated contribution in an easy way, some steps could be taken: * Provide a concrete example of the problem - why is it important? * Try to avoid a bottom up approach to explain the solution as much as possible. When there’s a lot of steps to get from point A to point B, the reader may lose the thread of these explanations. For insta
- It's interesting how the paper connects feasible-set estimation with robust policy optimization. The formulation naturally follows from the parameter-robust avoid problem and builds a nice bridge between reachability analysis and safe RL. - The proposed method is empirically validated on several benchmarks, showing consistent improvements in safe coverage. Theoretical claims are somewhat idealized but conceptually sound (see Weaknesses for details). - Overall, the paper offers a useful appr
- Theorem 1’s guarantees (zero false positives and controllable false negatives) hold only for a fixed policy $\pi$. But, FGE continuously updates $\pi$ using PPO, making the rollout distribution $\rho$ and the conditional success probability non-stationary. This violates the theorem’s assumptions and undermines the claimed guarantee. - Since FGE only samples parameters classified as infeasible, any false positive (infeasible but predicted feasible) is never revisited. This can cause the algor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Adaptive Dynamic Programming Control
