Guided Online Distillation: Promoting Safe Reinforcement Learning by Offline Demonstration
Jinning Li, Xinyi Liu, Banghua Zhu, Jiantao Jiao, Masayoshi Tomizuka,, Chen Tang, Wei Zhan

TL;DR
This paper introduces GOLD, a framework that distills offline expert policies into lightweight models to improve safe reinforcement learning, especially in safety-critical real-world tasks like autonomous driving.
Contribution
GOLD is a novel offline-to-online safe RL framework that effectively distills offline decision transformer policies into lightweight models for real-time safety-critical applications.
Findings
GOLD outperforms offline and online safe RL methods in benchmarks.
GOLD successfully applies to real-world autonomous driving scenarios.
Distilled policies meet real-time inference requirements.
Abstract
Safe Reinforcement Learning (RL) aims to find a policy that achieves high rewards while satisfying cost constraints. When learning from scratch, safe RL agents tend to be overly conservative, which impedes exploration and restrains the overall performance. In many realistic tasks, e.g. autonomous driving, large-scale expert demonstration data are available. We argue that extracting expert policy from offline data to guide online exploration is a promising solution to mitigate the conserveness issue. Large-capacity models, e.g. decision transformers (DT), have been proven to be competent in offline policy learning. However, data collected in real-world scenarios rarely contain dangerous cases (e.g., collisions), which makes it prohibitive for the policies to learn safety concepts. Besides, these bulk policy networks cannot meet the computation speed requirements at inference time on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Occupational Health and Safety Research · Reinforcement Learning in Robotics
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
