Zero-Shot Faithful Textual Explanations via Directional-Derivative Influence on Predictions
Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

TL;DR
This paper introduces FaithTrace, a zero-shot method for generating faithful textual explanations for image classifiers by measuring influence via directional derivatives, improving interpretability without task-specific supervision.
Contribution
FaithTrace is a novel zero-shot approach that quantifies the influence of textual explanations on model predictions using directional derivatives, enhancing faithfulness and evaluation metrics.
Findings
FaithTrace produces more faithful explanations than baselines.
The influence score correlates with the true impact of explanations on predictions.
Quantitative metrics based on influence scores help evaluate explanation faithfulness.
Abstract
Zero-shot textual explanations aim to make image classifiers more transparent by probing their internal representations, without relying on task-specific supervision or LVLMs. However, existing methods often miss the features that truly drive the prediction, resulting in limited \textit{faithfulness} to the evidence underlying the model's decision. To address this, we propose FaithTrace. Motivated by the idea that faithful explanations should describe concepts that strongly influence the prediction, FaithTrace directly measures how much the representation induced by the explanation changes the class logit. We introduce an influence score, computed as the directional derivative of the class logit along the text-induced direction in the classifier's feature space, and use it as a proxy for faithfulness. Moreover, we extend this influence score into quantitative evaluation metrics, helping…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
