SafeOR-Gym: A Benchmark Suite for Safe Reinforcement Learning Algorithms on Practical Operations Research Problems

Asha Ramanujam (1); Adam Elyoumi (1); Hao Chen (1); Sai Madhukiran Kompalli (1); Akshdeep Singh Ahluwalia (1); Shraman Pal (1); Dimitri J. Papageorgiou (2); and Can Li (1) ((1) Davidson School of Chemical Engineering; Purdue University; West Lafayette; IN (2) Energy Sciences; ExxonMobil Technology; Engineering Company; Annandale; NJ)

arXiv:2506.02255·cs.LG·June 4, 2025

SafeOR-Gym: A Benchmark Suite for Safe Reinforcement Learning Algorithms on Practical Operations Research Problems

Asha Ramanujam (1), Adam Elyoumi (1), Hao Chen (1), Sai Madhukiran Kompalli (1), Akshdeep Singh Ahluwalia (1), Shraman Pal (1), Dimitri J. Papageorgiou (2), and Can Li (1) ((1) Davidson School of Chemical Engineering, Purdue University, West Lafayette, IN (2) Energy Sciences

PDF

Open Access 4 Reviews

TL;DR

SafeOR-Gym introduces a set of nine complex, real-world-inspired environments for safe reinforcement learning, addressing a gap in benchmarks for industrial and high-stakes decision-making tasks involving structured constraints.

Contribution

It provides a novel benchmark suite tailored for safe RL in operations research problems with complex constraints and hybrid action spaces, facilitating progress in practical applications.

Findings

01

Current safe RL algorithms show varied performance across environments.

02

Some tasks are solvable, revealing strengths of existing methods.

03

Others expose fundamental limitations, highlighting areas for improvement.

Abstract

Most existing safe reinforcement learning (RL) benchmarks focus on robotics and control tasks, offering limited relevance to high-stakes domains that involve structured constraints, mixed-integer decisions, and industrial complexity. This gap hinders the advancement and deployment of safe RL in critical areas such as energy systems, manufacturing, and supply chains. To address this limitation, we present SafeOR-Gym, a benchmark suite of nine operations research (OR) environments tailored for safe RL under complex constraints. Each environment captures a realistic planning, scheduling, or control problems characterized by cost-based constraint violations, planning horizons, and hybrid discrete-continuous action spaces. The suite integrates seamlessly with the Constrained Markov Decision Process (CMDP) interface provided by OmniSafe. We evaluate several state-of-the-art safe RL algorithms…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 3

Strengths

- **Well-motivated contribution:** The authors target an important mismatch: popular safe-RL benchmarks are control/robotics focused, while many real safety problems are OR problems with mixed-integer structure and long horizons. SafeOR-Gym is clearly targeted to that gap. - **Implementation quality:** Environments expose explicit cost channels (CMDP wrapper) rather than only reward penalties, making them directly useful for constraint-aware algorithms in OmniSafe and similar toolkits. - **Empir

Weaknesses

- **Envirnonment diversity, but overlapping algorithmic challenges:** The nine environments span a good range of OR-inspired domains, from power systems to scheduling and process control. However, several share similar underlying learning characteristics — such as long-horizon decision-making, hybrid discrete–continuous actions, and feasibility-based safety constraints. For instance, RTN and STN represent closely related scheduling formulations. As a result, while the suite is cohesive and well-

Reviewer 02Rating 4Confidence 4

Strengths

The main strength of this work lies in its attempt to model complex and structured OR problems within a safe RL framework. The environments are well-motivated, grounded in realistic operational settings, and go beyond traditional control benchmarks like Safety Gym. Integrating these tasks with OmniSafe provides immediate usability and reproducibility, which makes SafeOR-Gym a valuable contribution to the community. The paper also highlights an important empirical finding: most current safe RL a

Weaknesses

The structure of the paper is not fully clear. While nine environments are introduced, only a few (two or three) are described in sufficient detail in the main text, with the rest delegated to the supplementary material. A concise comparative summary (e.g., a table of environment sizes, horizon lengths, constraint types, and stochasticity) would help readers understand their diversity and modeling differences. Moreover, although the authors refer to these tasks as real-world, most environments

Reviewer 03Rating 2Confidence 3

Strengths

1.The paper systematically evaluates CMDP-based safe RL algorithms in environments with structured constraints, mixed-integer decision structure, and hybrid discrete–continuous actions, spanning planning, power systems, chemical process control, and maintenance scheduling. The results clearly show that many widely used algorithms fail on these tasks. 2.The benchmark suite is implemented to be directly usable: it exposes a CMDP interface through OmniSafe while staying compatible with standard Gy

Weaknesses

1. The paper devotes a large amount of space to domain-specific operational details of each environment (e.g., industrial process assumptions, power system structure, scheduling rules), but provides relatively little insight into why current CMDP-style algorithms fail. The work reports outcomes (“algorithm X fails here, succeeds there”) but does not analyze which aspects of the tasks (e.g., mixed integers, nonconvex feasible sets, long-horizon credit assignment under feasibility penalties) are r

Reviewer 04Rating 2Confidence 3

Strengths

The key point is the provision of a set of different tasks with solid and well-documented representations, and strong baselines for comparison. The goal is to provide the community with a testbed beyond robotics and control tasks, which often treat constraint violations as merely negative rewards. The paper is mostly well written and covers similar works available in the literature.

Weaknesses

Although the work is undoubtedly well motivated, several aspects could be improved to enhance clarity and strengthen the contribution. **Introduction:** In the second paragraph, the discussion around OmniSafe occupies considerable space and may appear premature. At this early stage, readers who are not already familiar with the framework might find this section confusing, since its relevance to the benchmark’s motivation is not yet established. This content would be better placed later in the p

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScheduling and Optimization Algorithms