Reward Learning with Trees: Methods and Evaluation
Tom Bewley, Jonathan Lawry, Arthur Richards, Rachel Craddock, Ian, Henderson

TL;DR
This paper investigates using interpretable reward trees instead of neural networks for learning reward functions from human feedback, demonstrating competitive performance and enhanced transparency for verification and explanation.
Contribution
It introduces a method for learning reward trees from preferences and shows their effectiveness and interpretability in complex, high-dimensional tasks.
Findings
Reward trees perform comparably to neural networks on complex tasks.
Reward trees offer better transparency and traceability.
The method is robust to limited or corrupted data.
Abstract
Recent efforts to learn reward functions from human feedback have tended to use deep neural networks, whose lack of transparency hampers our ability to explain agent behaviour or verify alignment. We explore the merits of learning intrinsically interpretable tree models instead. We develop a recently proposed method for learning reward trees from preference labels, and show it to be broadly competitive with neural networks on challenging high-dimensional tasks, with good robustness to limited or corrupted data. Having found that reward tree learning can be done effectively in complex settings, we then consider why it should be used, demonstrating that the interpretable reward structure gives significant scope for traceability, verification and explanation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Bayesian Modeling and Causal Inference
