Towards Fast Safe Online Reinforcement Learning via Policy Finetuning
Keru Chen, Honghao Wei, Zhigang Deng, Sen Lin

TL;DR
This paper introduces Marvel, a novel framework that leverages offline safe RL to enable faster and safer online policy finetuning, addressing key challenges in aligning offline and online components.
Contribution
It proposes a new offline-to-online safe RL framework with value pre-alignment and adaptive Lagrangian control, improving safety and efficiency in online policy learning.
Findings
Marvel outperforms baselines in reward and safety.
Effective offline-online policy transfer demonstrated.
Addresses offline-online Q-estimation and Lagrangian mismatches.
Abstract
The high costs and risks involved in extensive environment interactions hinder the practical application of current online safe reinforcement learning (RL) methods. While offline safe RL addresses this by learning policies from static datasets, the performance therein is usually limited due to reliance on data quality and challenges with out-of-distribution (OOD) actions. Inspired by recent successes in offline-to-online (O2O) RL, it is crucial to explore whether offline safe RL can be leveraged to facilitate faster and safer online policy learning, a direction that has yet to be fully investigated. To fill this gap, we first demonstrate that naively applying existing O2O algorithms from standard RL would not work well in the safe RL setting due to two unique challenges: \emph{erroneous Q-estimations}, resulted from offline-online objective mismatch and offline cost sparsity, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection · Data Stream Mining Techniques · Reinforcement Learning in Robotics
MethodsALIGN
