Implicit Safety Alignment from Crowd Preferences

Qian Lin; Daniel S. Brown

arXiv:2605.21822·cs.AI·May 22, 2026

Implicit Safety Alignment from Crowd Preferences

Qian Lin, Daniel S. Brown

PDF

TL;DR

This paper introduces a hierarchical framework that learns shared safety criteria from crowd preferences to improve safety in reinforcement learning tasks without explicit safety rewards.

Contribution

It proposes Safe Crowd Preference-based RL, which extracts safety-aligned skills from crowd data and composes them to enhance safety in downstream tasks.

Findings

01

Reduces safety costs significantly in experiments.

02

Achieves task performance comparable to methods with ground-truth safety signals.

03

Identifies limitations of reward combination methods for safety objectives.

Abstract

Reinforcement Learning from Human Feedback (RLHF) can reveal implicit objectives such as safety considerations that go beyond task completion. In this work, we focus on the common safety criteria embedded in crowd preference datasets, where different users may express distinct preferences or objectives, yet follow similar safety principles. Our aim is to discover shared safety criteria from crowd preferences and then transfer them to downstream RL tasks to regularize agent behavior and enforce safety. We first show that direct reward combination-optimizing a preference-learned reward model together with downstream task rewards-has inherent limitations. Motivated by this, we propose Safe Crowd Preference-based RL, a hierarchical framework that extracts safety-aligned skills from crowd preferences and composes them via a high-level policy to safely solve downstream tasks. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.