Towards Fast Safe Online Reinforcement Learning via Policy Finetuning

Keru Chen; Honghao Wei; Zhigang Deng; Sen Lin

arXiv:2412.04426·cs.LG·January 26, 2026

Towards Fast Safe Online Reinforcement Learning via Policy Finetuning

Keru Chen, Honghao Wei, Zhigang Deng, Sen Lin

PDF

Open Access 1 Repo

TL;DR

This paper introduces Marvel, a novel framework that leverages offline safe RL to enable faster and safer online policy finetuning, addressing key challenges in aligning offline and online components.

Contribution

It proposes a new offline-to-online safe RL framework with value pre-alignment and adaptive Lagrangian control, improving safety and efficiency in online policy learning.

Findings

01

Marvel outperforms baselines in reward and safety.

02

Effective offline-online policy transfer demonstrated.

03

Addresses offline-online Q-estimation and Lagrangian mismatches.

Abstract

The high costs and risks involved in extensive environment interactions hinder the practical application of current online safe reinforcement learning (RL) methods. While offline safe RL addresses this by learning policies from static datasets, the performance therein is usually limited due to reliance on data quality and challenges with out-of-distribution (OOD) actions. Inspired by recent successes in offline-to-online (O2O) RL, it is crucial to explore whether offline safe RL can be leveraged to facilitate faster and safer online policy learning, a direction that has yet to be fully investigated. To fill this gap, we first demonstrate that naively applying existing O2O algorithms from standard RL would not work well in the safe RL setting due to two unique challenges: \emph{erroneous Q-estimations}, resulted from offline-online objective mismatch and offline cost sparsity, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CLIVERCHEN/Marvel-O2O_Safe_RL
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Security and Intrusion Detection · Data Stream Mining Techniques · Reinforcement Learning in Robotics

MethodsALIGN