Reward Learning with Trees: Methods and Evaluation

Tom Bewley; Jonathan Lawry; Arthur Richards; Rachel Craddock; Ian; Henderson

arXiv:2210.01007·cs.LG·October 4, 2022

Reward Learning with Trees: Methods and Evaluation

Tom Bewley, Jonathan Lawry, Arthur Richards, Rachel Craddock, Ian, Henderson

PDF

Open Access

TL;DR

This paper investigates using interpretable reward trees instead of neural networks for learning reward functions from human feedback, demonstrating competitive performance and enhanced transparency for verification and explanation.

Contribution

It introduces a method for learning reward trees from preferences and shows their effectiveness and interpretability in complex, high-dimensional tasks.

Findings

01

Reward trees perform comparably to neural networks on complex tasks.

02

Reward trees offer better transparency and traceability.

03

The method is robust to limited or corrupted data.

Abstract

Recent efforts to learn reward functions from human feedback have tended to use deep neural networks, whose lack of transparency hampers our ability to explain agent behaviour or verify alignment. We explore the merits of learning intrinsically interpretable tree models instead. We develop a recently proposed method for learning reward trees from preference labels, and show it to be broadly competitive with neural networks on challenging high-dimensional tasks, with good robustness to limited or corrupted data. Having found that reward tree learning can be done effectively in complex settings, we then consider why it should be used, demonstrating that the interpretable reward structure gives significant scope for traceability, verification and explanation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Bayesian Modeling and Causal Inference