GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL
Zifan Liu, Xinran Li, Shibo Chen, Jun Zhang

TL;DR
GAS is a novel algorithm that improves offline safe reinforcement learning by enhancing trajectory stitching and balancing reward and cost constraints through dataset augmentation, relabeling, and goal estimation.
Contribution
GAS introduces goal-assisted stitching with dataset augmentation and novel goal functions to better balance reward maximization and safety constraints in offline safe RL.
Findings
Outperforms existing methods in reward-cost tradeoff
Enhances trajectory stitching from suboptimal data
Achieves more stable and efficient training
Abstract
Offline Safe Reinforcement Learning (OSRL) aims to learn a policy to achieve high performance in sequential decision-making while satisfying constraints, using only pre-collected datasets. Recent works, inspired by the strong capabilities of Generative Models (GMs), reformulate decision-making in OSRL as a conditional generative process, where GMs generate desirable actions conditioned on predefined reward and cost values. However, GM-assisted methods face two major challenges in OSRL: (1) lacking the ability to "stitch" optimal transitions from suboptimal trajectories within the dataset, and (2) struggling to balance reward targets with cost targets, particularly when they are conflict. To address these issues, we propose Goal-Assisted Stitching (GAS), a novel algorithm designed to enhance stitching capabilities while effectively balancing reward maximization and constraint…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper pinpoints two gaps in GM-assisted offline safe RL: weak stitching and inadequate reward–cost balancing. - Using goal functions to guide a constrained Advantage-Weighted Regression policy, aligning actions with feasible reward–cost targets is interesting. - GAS adapts to different test-time constraint levels without retraining.
- In RL, "stitching" traditionally refers to combining different trajectories via bootstrapping (the Bellman backup). The paper itself identifies this as a "crucial ability" that Generative Model (GM) methods lack. However, the paper's entire premise is to avoid the Bellman backup, which it calls the "primary source of the OOD problem". It then calls its own mechanism, supervised goal estimation using expectile regression, "stitching". This is a misleading contradiction. The proposed method does
- This work tackles a practical and important problem in Offline Safe Reinforcement Learning (OSRL) with Generative Models (GMs), focusing on two key challenges: balancing reward and cost, and improving the model’s ability to stitch useful transitions from different trajectories. - The paper presents solid theoretical support for the proposed GAS algorithm, especially through its use of expectile regression to estimate optimal reward and cost goals without relying on Bellman backups. This design
- The reward and cost goal functions are learned from offline data using expectile regression, which makes them susceptible to the biases of the dataset. If the data are unbalanced or lack high-quality transitions, the estimated goals can become either too optimistic or too conservative, leading the policy in the wrong direction. Because these functions guide the policy’s optimization, even small estimation errors can distort the balance between reward and cost or result in unsafe actions. - The
- Two motivations of paper, i.e., lack of stitching and failure to balance reward and cost, are clearly presented and important to safe offline RL community. - GAS exhibits consistent advantage over most baselines on DSRL in terms of reward improvement and cost constraint satisfaction, in both situations with tight or loose constraint. - The experiment is comprehensive. The authors also provide ablations to explain the motivation or show the effectiveness of relabel module.
- The "stitching" ability of the GAS is a little over-claimed. The authors claim the GAS has the capability to stitch the sub-trajectories. However, GAS seems to learn new return-to-go and cost-to-go targets instead of stitching the data. See more in the question section. I understand that the true stitching can be hard but I believe there is a large mismatch between "stitching" and GAS's implementation.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
