Expand VSR Benchmark for VLLM to Expertize in Spatial Rules
Peijin Xie, Lin Sun, Bingquan Liu, Dexin Wang, Xiangzheng Zhang,, Chengjie Sun, Jiajia Zhang

TL;DR
This paper enhances the evaluation and training of Vision Large Language Models (VLLMs) for visual spatial reasoning by expanding datasets, integrating multiple visual encoders, and using diffusion models to improve positional understanding, resulting in a significantly more accurate model.
Contribution
The authors expanded the VSR benchmark with controllably generated spatial data and integrated multiple visual encoders, creating a VLLM that excels in visual positional reasoning.
Findings
VLLMs show over-sensitivity to language and under-sensitivity to visual position.
Expanded datasets and model structures improve positional reasoning accuracy.
VSRE achieved over 27% higher accuracy on the VSR test set.
Abstract
Distinguishing spatial relations is a basic part of human cognition which requires fine-grained perception on cross-instance. Although benchmarks like MME, MMBench and SEED comprehensively have evaluated various capabilities which already include visual spatial reasoning(VSR). There is still a lack of sufficient quantity and quality evaluation and optimization datasets for Vision Large Language Models(VLLMs) specifically targeting visual positional reasoning. To handle this, we first diagnosed current VLLMs with the VSR dataset and proposed a unified test set. We found current VLLMs to exhibit a contradiction of over-sensitivity to language instructions and under-sensitivity to visual positional information. By expanding the original benchmark from two aspects of tunning data and model structure, we mitigated this phenomenon. To our knowledge, we expanded spatially positioned image data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGeographic Information Systems Studies · Data Management and Algorithms · Semantic Web and Ontologies
MethodsDiffusion · Segment Anything Model
