Principal Prototype Analysis on Manifold for Interpretable Reinforcement Learning
Bodla Krishna Vamshi, Haizhao Yang

TL;DR
This paper introduces a method for automatic prototype selection in reinforcement learning to improve interpretability without sacrificing performance, building on prototype-based explainability methods.
Contribution
It proposes an automatic prototype selection technique that removes the need for manual reference prototypes in explainability methods for RL.
Findings
Matches the performance of existing PW-Nets in standard Gym environments.
Remains competitive with original black-box models.
Enhances interpretability in RL without performance loss.
Abstract
Recent years have witnessed the widespread adoption of reinforcement learning (RL), from solving real-time games to fine-tuning large language models using human preference data significantly improving alignment with user expectations. However, as model complexity grows exponentially, the interpretability of these systems becomes increasingly challenging. While numerous explainability methods have been developed for computer vision and natural language processing to elucidate both local and global reasoning patterns, their application to RL remains limited. Direct extensions of these methods often struggle to maintain the delicate balance between interpretability and performance within RL settings. Prototype-Wrapper Networks (PW-Nets) have recently shown promise in bridging this gap by enhancing explainability in RL domains without sacrificing the efficiency of the original black-box…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
