Can Differentiable Decision Trees Enable Interpretable Reward Learning from Human Feedback?
Akansha Kalra, Daniel S. Brown

TL;DR
This paper introduces Differentiable Decision Trees for learning interpretable reward functions from human preferences, enabling better alignment understanding and competitive reinforcement learning performance across various domains.
Contribution
It proposes a novel method using Differentiable Decision Trees for reward learning that enhances interpretability and diagnostic utility in reinforcement learning from human feedback.
Findings
Reward DDTs can achieve competitive RL performance.
Tree structure helps assess reward alignment with human preferences.
Soft vs. hard outputs reveal a trade-off between interpretability and reward shaping.
Abstract
Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for capturing human intent to alleviate the challenges of hand-crafting the reward values. Despite the increasing interest in RLHF, most works learn black box reward functions that while expressive are difficult to interpret and often require running the whole costly process of RL before we can even decipher if these frameworks are actually aligned with human preferences. We propose and evaluate a novel approach for learning expressive and interpretable reward functions from preferences using Differentiable Decision Trees (DDTs). Our experiments across several domains, including CartPole, Visual Gridworld environments and Atari games, provide evidence that the tree structure of our learned reward function is useful in determining the extent to which the reward function is aligned with human preferences.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health Research Topics · Explainable Artificial Intelligence (XAI)
