Self Reward Design with Fine-grained Interpretability
Erico Tjoa, Guan Cuntai

TL;DR
This paper introduces Self Reward Design (SRD), a framework for creating interpretable neural networks in deep reinforcement learning, enabling transparent decision-making and addressing fairness concerns by aligning network components with human-understandable concepts.
Contribution
The paper presents SRD, a novel bottom-up neural network design that enhances interpretability and can be optimized like standard DNNs, applied to RL problems and semantic decision tasks.
Findings
SRD enables solving RL tasks with few parameters through deliberate human design.
SRD provides human-understandable decision-making in complex scenarios.
The framework demonstrates practical interpretability benefits in real-world examples.
Abstract
The black-box nature of deep neural networks (DNN) has brought to attention the issues of transparency and fairness. Deep Reinforcement Learning (Deep RL or DRL), which uses DNN to learn its policy, value functions etc, is thus also subject to similar concerns. This paper proposes a way to circumvent the issues through the bottom-up design of neural networks with detailed interpretability, where each neuron or layer has its own meaning and utility that corresponds to humanly understandable concept. The framework introduced in this paper is called the Self Reward Design (SRD), inspired by the Inverse Reward Design, and this interpretable design can (1) solve the problem by pure design (although imperfectly) and (2) be optimized like a standard DNN. With deliberate human designs, we show that some RL problems such as lavaland and MuJoCo can be solved using a model constructed with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI)
