TL;DR
This paper introduces a black-box framework called Counterfactual Semantic Saliency to evaluate how well vision-language models align with human scene perception, revealing biases and gaps in model understanding.
Contribution
The authors propose a novel, model-agnostic method to quantify object importance and assess AI-human semantic alignment in complex scenes.
Findings
Models over-rely on large, central, and salient objects compared to humans.
A size bias explains much of the semantic divergence between models and humans.
Models rely less on people in scenes than human participants do.
Abstract
Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
