The Interpretability of Codebooks in Model-Based Reinforcement Learning is Limited
Kenneth Eaton, Jonathan Balloch, Julia Kim, Mark Riedl

TL;DR
This paper critically examines whether vector quantization in model-based reinforcement learning offers true interpretability, finding that it is inconsistent, non-unique, and limited in aiding understanding of the model's concepts.
Contribution
The study provides empirical evidence that vector quantization does not reliably enhance interpretability in model-based reinforcement learning.
Findings
Codes are inconsistent across models
No guarantee of code uniqueness
Limited impact on concept disentanglement
Abstract
Interpretability of deep reinforcement learning systems could assist operators with understanding how they interact with their environment. Vector quantization methods -- also called codebook methods -- discretize a neural network's latent space that is often suggested to yield emergent interpretability. We investigate whether vector quantization in fact provides interpretability in model-based reinforcement learning. Our experiments, conducted in the reinforcement learning environment Crafter, show that the codes of vector quantization models are inconsistent, have no guarantee of uniqueness, and have a limited impact on concept disentanglement, all of which are necessary traits for interpretability. We share insights on why vector quantization may be fundamentally insufficient for model interpretability.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFormal Methods in Verification · Reinforcement Learning in Robotics · Software Engineering Research
