Interpretable Preference-based Reinforcement Learning with Tree-Structured Reward Functions
Tom Bewley, Freddy Lecue

TL;DR
This paper introduces an interpretable, tree-structured reward function learning method for preference-based reinforcement learning, enabling more transparent and robust alignment through active, sample-efficient feedback integration.
Contribution
It presents an online, active learning algorithm that constructs interpretable, compositional reward functions with a tree structure, improving interpretability and debugging in PbRL.
Findings
Sample-efficient learning of tree-structured rewards from synthetic and human feedback.
Enhanced interpretability facilitates exploration and debugging for alignment.
Demonstrated effectiveness across multiple environments.
Abstract
The potential of reinforcement learning (RL) to deliver aligned and performant agents is partially bottlenecked by the reward engineering problem. One alternative to heuristic trial-and-error is preference-based RL (PbRL), where a reward function is inferred from sparse human feedback. However, prior PbRL methods lack interpretability of the learned reward structure, which hampers the ability to assess robustness and alignment. We propose an online, active preference learning algorithm that constructs reward functions with the intrinsically interpretable, compositional structure of a tree. Using both synthetic and human-provided feedback, we demonstrate sample-efficient learning of tree-structured reward functions in several environments, then harness the enhanced interpretability to explore and debug for alignment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
