Rule Based Rewards for Language Model Safety
Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea, Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, Lilian Weng

TL;DR
This paper introduces Rule Based Rewards (RBR), a novel method for improving language model safety by using rule-based AI feedback and minimal human data, leading to more accurate safety behavior control.
Contribution
The paper presents RBR, a new approach that combines rule-based AI feedback with few-shot prompts for safer language model fine-tuning, reducing reliance on extensive human-labeled data.
Findings
RBR achieves an F1 score of 97.1 in safety behavior evaluation.
RBR outperforms human-feedback baselines with an F1 score of 91.7.
RBR provides better control and ease of updating safety behaviors.
Abstract
Reinforcement learning based fine-tuning of large language models (LLMs) on human preferences has been shown to enhance both their capabilities and safety behavior. However, in cases related to safety, without precise instructions to human annotators, the data collected may cause the model to become overly cautious, or to respond in an undesirable style, such as being judgmental. Additionally, as model capabilities and usage patterns evolve, there may be a costly need to add or relabel data to modify safety behavior. We propose a novel preference modeling approach that utilizes AI feedback and only requires a small amount of human data. Our method, Rule Based Rewards (RBR), uses a collection of rules for desired or undesired behaviors (e.g. refusals should not be judgmental) along with a LLM grader. In contrast to prior methods using AI feedback, our method uses fine-grained,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSafety Systems Engineering in Autonomy · Access Control and Trust · Software Reliability and Analysis Research
