TL;DR
This paper introduces Semantic Pullbacks, a novel method for interpreting deep networks by reconstructing meaningful input features from neuron activations, improving explanation quality and robustness.
Contribution
It presents a unified framework for local explanations of deep models using input-conditioned affine operators and iterative enhancement, outperforming existing methods.
Findings
Semantic Pullbacks produce perceptually aligned, class-conditional explanations.
They enable coherent counterfactual perturbations.
They achieve state-of-the-art trade-offs on faithfulness, stability, and sensitivity benchmarks.
Abstract
In linear models, visualizing a weight vector naturally reveals the model's preferred input direction, but extending this intuition to deep networks via gradients or gradient ascent often yields brittle or adversarial-looking features. We argue that deep networks are better understood as input-conditioned affine operators, whose natural adjoint action pulls a neuron's preferred direction back to input space. We further refine this representation by backward-only softening and iterative enhancement to reconstruct coherent local structures encoded by the target neuron. This provides a unifying perspective on previously disparate ideas such as SmoothGrad, B-cos-style alignment, and Feature Accentuation. The resulting Semantic Pullbacks (SP) generate perceptually aligned, class-conditional post-hoc explanations that emphasize semantically meaningful features, facilitate coherent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
