Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations
Gabriele Lombardo, Luigi Maiorana, Liliana Lo Presti, Marco La Cascia

TL;DR
This paper investigates whether embedding anisotropy causes failures in visual grounding models under counterfactual perturbations, finding no significant correlation and suggesting the need to explore finer geometric properties.
Contribution
Introduces a similarity-controlled counterfactual caption generation protocol to analyze grounding behavior and tests it on two Transformer-based models with different embedding geometries.
Findings
No meaningful correlation between cosine similarity and approximation errors.
Embedding anisotropy alone does not explain counterfactual failures.
Robustness likely depends on finer-grained geometric properties of embeddings.
Abstract
Visual Grounding benchmarks assume that the object described by a referring expression is always present in the image, and grounding models are therefore rarely evaluated under semantically mismatched captions. In such cases, models frequently exhibit approximation behavior, producing a plausible bounding box that satisfies only part of the expression (\eg, preserving the original object while ignoring modified contextual cues). Because mismatched captions represent realistic edge cases, this behavior compromises reliability and raises concerns from an explainability perspective. Identifying its underlying causes is thus essential for improving model faithfulness and interpretability. Adopting a mechanistic interpretability viewpoint, this work examines whether embedding anisotropy contributes to counterfactual failures. A similarity-controlled counterfactual caption generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
