Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

Ziqi Wen; Parsa Madinei; Miguel P. Eckstein

arXiv:2605.13047·cs.CV·May 14, 2026

Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

Ziqi Wen, Parsa Madinei, Miguel P. Eckstein

PDF

1 Repo

TL;DR

This paper introduces a black-box framework called Counterfactual Semantic Saliency to evaluate how well vision-language models align with human scene perception, revealing biases and gaps in model understanding.

Contribution

The authors propose a novel, model-agnostic method to quantify object importance and assess AI-human semantic alignment in complex scenes.

Findings

01

Models over-rely on large, central, and salient objects compared to humans.

02

A size bias explains much of the semantic divergence between models and humans.

03

Models rely less on people in scenes than human participants do.

Abstract

Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

starsky77/Counterfactual-Semantic-Saliency
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.