Interpretable Reward Modeling with Active Concept Bottlenecks
Sonia Laguna, Katarzyna Kobalczyk, Julia E. Vogt, Mihaela Van der Schaar

TL;DR
This paper presents Concept Bottleneck Reward Models (CB-RM), a framework that improves interpretability and efficiency in reward modeling by decomposing rewards into human-understandable concepts and actively selecting the most informative labels.
Contribution
Introduces CB-RM with an active learning strategy based on Expected Information Gain to enhance interpretability and sample efficiency in reward modeling.
Findings
CB-RM outperforms baselines in interpretability.
Active concept acquisition accelerates learning.
Method achieves high preference accuracy with fewer labels.
Abstract
We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning through selective concept annotation. Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts. To make this framework efficient in low-supervision settings, we formalize an active learning strategy that dynamically acquires the most informative concept labels. We propose an acquisition function based on Expected Information Gain and show that it significantly accelerates concept learning without compromising preference accuracy. Evaluated on the UltraFeedback dataset, our method outperforms baselines in interpretability and sample efficiency, marking a step towards more transparent, auditable, and human-aligned reward models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
