Designing Interpretable Approximations to Deep Reinforcement Learning
Nathan Dahlin, Krishna Chaitanya Kalagarla, Nikhil Naik, Rahul Jain,, Pierluigi Nuzzo

TL;DR
This paper explores creating simpler, interpretable models that approximate deep reinforcement learning systems, aiming to balance performance with explainability and efficiency.
Contribution
It introduces methods for designing reduced models that maintain performance while providing interpretability, demonstrated on decision trees and kernel machines in reinforcement learning.
Findings
Reduced models can preserve key performance metrics.
Interpretable models explain latent knowledge effectively.
Approach is validated on benchmark RL tasks.
Abstract
In an ever expanding set of research and application areas, deep neural networks (DNNs) set the bar for algorithm performance. However, depending upon additional constraints such as processing power and execution time limits, or requirements such as verifiable safety guarantees, it may not be feasible to actually use such high-performing DNNs in practice. Many techniques have been developed in recent years to compress or distill complex DNNs into smaller, faster or more understandable models and controllers. This work seeks to identify reduced models that not only preserve a desired performance level, but also, for example, succinctly explain the latent knowledge represented by a DNN. We illustrate the effectiveness of the proposed approach on the evaluation of decision tree variants and kernel machines in the context of benchmark reinforcement learning tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
