Conditional Autoregressors are Interpretable Classifiers
Nathan Elazar

TL;DR
This paper demonstrates that class-conditional autoregressive models can serve as inherently interpretable classifiers for image data, and with knowledge distillation, they can achieve competitive accuracy.
Contribution
It introduces the use of CA models for classification and shows how to train them effectively for interpretability without sacrificing performance.
Findings
CA models are inherently locally interpretable.
Naive training of CA models results in poor accuracy due to overfitting.
Knowledge distillation enables CA models to match standard classifiers' performance.
Abstract
We explore the use of class-conditional autoregressive (CA) models to perform image classification on MNIST-10. Autoregressive models assign probability to an entire input by combining probabilities from each individual feature; hence classification decisions made by a CA can be readily decomposed into contributions from each each input feature. That is to say, CA are inherently locally interpretable. Our experiments show that naively training a CA achieves much worse accuracy compared to a standard classifier, however this is due to over-fitting and not a lack of expressive power. Using knowledge distillation from a standard classifier, a student CA can be trained to match the performance of the teacher while still being interpretable.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning · Model Reduction and Neural Networks
MethodsKnowledge Distillation
