CREG: Compass Relational Evidence Graph for Characterizing Directional Structure in VLM Spatial-Reasoning Attribution
Kaizhen Tan, Yang Feng, Heqing Du

TL;DR
CREG is a diagnostic framework that assesses the directional organization of evidence in vision-language models' spatial reasoning, revealing limitations of current attribution methods.
Contribution
Introduces CREG, a training-free method to evaluate the directional structure in attribution maps, highlighting gaps between attribution and true spatial reasoning.
Findings
Geometry-based controls outperform attribution methods in directional alignment.
Current attribution methods often reflect image layout rather than true spatial relations.
Higher task accuracy does not necessarily improve directional attribution quality.
Abstract
Standard attribution heatmaps show where a vision-language model (VLM) focuses, but they do not reveal whether the recovered evidence is organized by the queried spatial relation or merely reflects image layout. To address this problem, we introduce CREG (Compass Relational Evidence Graph), a training-free diagnostic framework that converts token-level attribution into a reference-centered compass distribution and measures its directional alignment. CREG provides a shared directional readout across attribution methods and makes comparison with geometric controls explicit. Across three spatial-relation benchmarks, box-only geometry achieves Direction Alignment Error more than 30 degrees lower than current model-based attribution methods, leaving a substantial gap between attribution structure and simple target localization. To examine this gap, we apply a diagnostic battery including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
