Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models
Julien Colin, Lore Goetschalckx, Nuria Oliver, Thomas Serre

TL;DR
This paper introduces a framework to measure human interpretability of vision models, revealing foundation models are less interpretable than supervised ones, with interpretability linked to feature locality and semantic alignment.
Contribution
The authors develop a psychophysics-based framework for quantifying interpretability of vision models and demonstrate its effectiveness across multiple models and protocols.
Findings
Foundation models are less interpretable than supervised models.
Interpretability correlates with feature locality and semantic alignment.
Interpretability does not impact downstream task performance.
Abstract
How interpretable are the features of leading vision models? The question is increasingly pressing as these models move from research benchmarks into high-stakes deployments, yet existing methods cannot answer it reliably. We close this gap with a framework for measuring and comparing the human interpretability of vision models, built around two complementary psychophysics protocols: (1) localizability -- can an observer predict where a feature fires on a novel image? -- and (2) nameability -- can an observer accurately describe what the feature represents? Features are recovered via sparse autoencoders, and a chance-anchored scoring function places every model on a common scale. Applying the framework to six vision transformers -- two supervised ViTs and four foundation models (DINOv2, DINOv3, CLIP, SigLIP) -- we collected more than behavioral responses, analyzing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
