Data-adaptive Safety Rules for Training Reward Models
Xiaomin Li, Mingye Gao, Zhiwei Zhang, Jingxuan Fan, Weiyu Li

TL;DR
This paper introduces a dynamic, data-adaptive approach for selecting safety rules in training reward models for LLMs, improving safety performance by maximizing mutual information between annotations and true preferences.
Contribution
It proposes a novel mathematical framework for adaptive rule selection in RLHF, enhancing safety and efficiency in reward model training.
Findings
Achieved the highest safety performance on RewardBench as of Jan 2025.
Demonstrated theoretical maximization of mutual information with adaptive rule selection.
Trained an 8B reward model surpassing larger models in safety metrics.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is commonly employed to tailor models to human preferences, especially to improve the safety of outputs from large language models (LLMs). Traditionally, this method depends on selecting preferred responses from pairs. However, due to the variability in human opinions and the challenges in directly comparing two responses, there is an increasing trend towards fine-grained annotation approaches that evaluate responses using multiple targeted metrics or rules. The challenge lies in efficiently choosing and applying these rules to handle the diverse range of preference data. In this paper, we propose a dynamic method that adaptively selects the most important rules for each response pair. We introduce a mathematical framework that utilizes the maximum discrepancy across paired responses and demonstrate theoretically that this approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Reliability and Analysis Research · Risk and Safety Analysis · Safety Systems Engineering in Autonomy
