Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments
Icaro Re Depaolini, Uri Hasson

TL;DR
Deep neural networks can predict human judgments of image authenticity but do not provide consistent or reliable explanations for their predictions, raising questions about the interpretability of such models.
Contribution
This study systematically evaluates the robustness and cross-architecture consistency of attribution explanations in deep networks predicting human authenticity judgments.
Findings
Models predict ratings well, reaching about 80% of the noise ceiling.
Attribution maps are stable within architectures but inconsistent across different architectures.
Ensemble models improve prediction and attribution for image authenticity.
Abstract
Deep neural networks can predict human judgments, but this does not imply that they rely on human-like information or reveal the cues underlying those judgments. Prior work has addressed this issue using attribution heatmaps, but their explanatory value in itself depends on robustness. Here we tested the robustness of such explanations by evaluating whether models that predict human authenticity ratings also produce consistent explanations within and across architectures. We fit lightweight regression heads to multiple frozen pretrained vision models and generated attribution maps using Grad-CAM, LIME, and multiscale pixel masking. Several architectures predicted ratings well, reaching about 80% of the noise ceiling. VGG models achieved this by tracking image quality rather than authenticity-specific variance, limiting the relevance of their attributions. Among the remaining models,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
