TL;DR
This paper introduces FaCT, a method for faithful, model-inherent concept explanations in neural networks, enabling consistent interpretation across layers and classes, with a new evaluation metric.
Contribution
FaCT provides a novel approach for concept explanations that are inherently faithful and shared across classes, along with a new metric for evaluating concept consistency.
Findings
FaCT's concepts are more consistent and interpretable than prior methods.
FaCT maintains competitive ImageNet performance.
The C$^2$-Score effectively evaluates concept-based explanations.
Abstract
Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C-Score, that can be used to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
