FOSP: Fine-tuning Offline Safe Policy through World Models

Chenyang Cao; Yucheng Xin; Silang Wu; Longxiang He; Zichen Yan; Junbo; Tan; Xueqian Wang

arXiv:2407.04942·cs.RO·March 4, 2025

FOSP: Fine-tuning Offline Safe Policy through World Models

Chenyang Cao, Yucheng Xin, Silang Wu, Longxiang He, Zichen Yan, Junbo, Tan, Xueqian Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces FOSP, a method combining offline safe RL with online fine-tuning using world models to enhance safety and generalization in vision-based robotic tasks, validated through simulations and real-world experiments.

Contribution

It presents a novel offline-to-online safe RL framework that employs model-based RL and reachability guidance for improved safety and generalization in robotic tasks.

Findings

01

Significant improvement in safety generalization on unseen scenarios.

02

Effective online fine-tuning after offline pretraining.

03

Validated on five vision-only simulation tasks and real robots.

Abstract

Offline Safe Reinforcement Learning (RL) seeks to address safety constraints by learning from static datasets and restricting exploration. However, these approaches heavily rely on the dataset and struggle to generalize to unseen scenarios safely. In this paper, we aim to improve safety during the deployment of vision-based robotic tasks through online fine-tuning an offline pretrained policy. To facilitate effective fine-tuning, we introduce model-based RL, which is known for its data efficiency. Specifically, our method employs in-sample optimization to improve offline training efficiency while incorporating reachability guidance to ensure safety. After obtaining an offline safe policy, a safe policy expansion approach is leveraged for online fine-tuning. The performance of our method is validated on simulation benchmarks with five vision-only tasks and through real-world robot…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. The FOSP method introduces the concept of safe generalization within the offline-to-online reinforcement learning framework, combining world models with offline training and online fine-tuning to enhance safety and performance. This novel strategy for applying reinforcement learning in safety-critical scenarios is insightful. 2. The design of experiments is comprehensive while covering tasks in both simulated environments and validations in real robotic settings. 3. The experiments of FOSP in

Weaknesses

1. While the experiments were validated in Safety-Gymnasium and real robotic environments, the number and complexity of tasks remain relatively limited. 2. The motivation for introducing the new offline to online setting in the paper has not been well communicated.

Reviewer 02Rating 6Confidence 4

Strengths

Extensive experiments and ablation studies were conducted, including a real-world robot experiment with high-dimensional visual observations. The proposed FOSP algorithm achieved good performance, especially in the real-world robot experiments, and outperformed baseline approaches.

Weaknesses

While the proposed algorithm performed well in extensive experiments, I found it hard to appreciate its technical contributions due to its complex algorithm design (e.g., many moving parts, iterations between offline and online learning) and confusing presentation. It is unclear to me if all the design choices are necessary and which components are novel or come from the literature. It would be good if the authors could clearly state their technical contributions and novelty (e.g., does the nove

Reviewer 03Rating 6Confidence 4

Strengths

1. The proposed safe model-based RL framework effectively addresses offline-online generalization tasks. 2. It demonstrates the capability to safely fine-tune in previously unseen safety-constrained scenarios during real-world deployment.

Weaknesses

1. In Figure 4, the results for DreamerV3 could be omitted, as they overshadow the cost performance of all baseline methods. 2. There are too few obstacles in the real-world environment, making it difficult to assess the agent's obstacle avoidance behavior. And the website does not provide any differences between the proposed algorithm and the baseline in the video demos. 3. In Figure 4, why do the rewards for FOSP in PointGoal2 and PointGoal1 not continue to rise? Additionally, SafeDreamer does

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInternational Development and Aid · Global Peace and Security Dynamics