Interpretable Reward Modeling with Active Concept Bottlenecks

Sonia Laguna; Katarzyna Kobalczyk; Julia E. Vogt; Mihaela Van der Schaar

arXiv:2507.04695·cs.LG·July 22, 2025

Interpretable Reward Modeling with Active Concept Bottlenecks

Sonia Laguna, Katarzyna Kobalczyk, Julia E. Vogt, Mihaela Van der Schaar

PDF

TL;DR

This paper presents Concept Bottleneck Reward Models (CB-RM), a framework that improves interpretability and efficiency in reward modeling by decomposing rewards into human-understandable concepts and actively selecting the most informative labels.

Contribution

Introduces CB-RM with an active learning strategy based on Expected Information Gain to enhance interpretability and sample efficiency in reward modeling.

Findings

01

CB-RM outperforms baselines in interpretability.

02

Active concept acquisition accelerates learning.

03

Method achieves high preference accuracy with fewer labels.

Abstract

We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning through selective concept annotation. Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts. To make this framework efficient in low-supervision settings, we formalize an active learning strategy that dynamically acquires the most informative concept labels. We propose an acquisition function based on Expected Information Gain and show that it significantly accelerates concept learning without compromising preference accuracy. Evaluated on the UltraFeedback dataset, our method outperforms baselines in interpretability and sample efficiency, marking a step towards more transparent, auditable, and human-aligned reward models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.