Expand VSR Benchmark for VLLM to Expertize in Spatial Rules

Peijin Xie; Lin Sun; Bingquan Liu; Dexin Wang; Xiangzheng Zhang,; Chengjie Sun; Jiajia Zhang

arXiv:2412.18224·cs.CV·December 25, 2024

Expand VSR Benchmark for VLLM to Expertize in Spatial Rules

Peijin Xie, Lin Sun, Bingquan Liu, Dexin Wang, Xiangzheng Zhang,, Chengjie Sun, Jiajia Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper enhances the evaluation and training of Vision Large Language Models (VLLMs) for visual spatial reasoning by expanding datasets, integrating multiple visual encoders, and using diffusion models to improve positional understanding, resulting in a significantly more accurate model.

Contribution

The authors expanded the VSR benchmark with controllably generated spatial data and integrated multiple visual encoders, creating a VLLM that excels in visual positional reasoning.

Findings

01

VLLMs show over-sensitivity to language and under-sensitivity to visual position.

02

Expanded datasets and model structures improve positional reasoning accuracy.

03

VSRE achieved over 27% higher accuracy on the VSR test set.

Abstract

Distinguishing spatial relations is a basic part of human cognition which requires fine-grained perception on cross-instance. Although benchmarks like MME, MMBench and SEED comprehensively have evaluated various capabilities which already include visual spatial reasoning(VSR). There is still a lack of sufficient quantity and quality evaluation and optimization datasets for Vision Large Language Models(VLLMs) specifically targeting visual positional reasoning. To handle this, we first diagnosed current VLLMs with the VSR dataset and proposed a unified test set. We found current VLLMs to exhibit a contradiction of over-sensitivity to language instructions and under-sensitivity to visual positional information. By expanding the original benchmark from two aspects of tunning data and model structure, we mitigated this phenomenon. To our knowledge, we expanded spatially positioned image data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

peijin360/vsre
noneOfficial

Videos

Expand VSR Benchmark for VLLM to Expertize in Spatial Rules· underline

Taxonomy

TopicsGeographic Information Systems Studies · Data Management and Algorithms · Semantic Web and Ontologies

MethodsDiffusion · Segment Anything Model