Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents

Quentin Delfosse; Sebastian Sztwiertnia; Mark Rothermel; Wolfgang; Stammer; Kristian Kersting

arXiv:2401.05821·cs.LG·October 30, 2024·1 cites

Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents

Quentin Delfosse, Sebastian Sztwiertnia, Mark Rothermel, Wolfgang, Stammer, Kristian Kersting

PDF

Open Access 1 Repo 1 Video

TL;DR

SCoBots introduce successive concept bottleneck layers that incorporate object relations, enhancing interpretability and alignment of reinforcement learning agents, demonstrated through improved understanding and correction of behaviors in complex tasks.

Contribution

The paper proposes SCoBots, a novel RL model with layered concept bottlenecks that include object relations, improving interpretability and domain expert intervention capabilities.

Findings

01

SCoBots achieve competitive performance in RL tasks.

02

SCoBots enable identification and correction of misalignments in Pong.

03

SCoBots improve interpretability and human alignment of RL agents.

Abstract

Goal misalignment, reward sparsity and difficult credit assignment are only a few of the many issues that make it difficult for deep reinforcement learning (RL) agents to learn optimal policies. Unfortunately, the black-box nature of deep neural networks impedes the inclusion of domain experts for inspecting the model and revising suboptimal policies. To this end, we introduce *Successive Concept Bottleneck Agents* (SCoBots), that integrate consecutive concept bottleneck (CB) layers. In contrast to current CB models, SCoBots do not just represent concepts as properties of individual objects, but also as relations between objects which is crucial for many RL tasks. Our experimental results provide evidence of SCoBots' competitive performances, but also of their potential for domain experts to understand and regularize their behavior. Among other things, SCoBots enabled us to identify a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

k4ntz/scobots
pytorchOfficial

Videos

Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents· slideslive

Taxonomy

TopicsData Stream Mining Techniques · Explainable Artificial Intelligence (XAI)