SPR-128K: A New Benchmark for Spatial Plausibility Reasoning with Multimodal Large Language Models
Zhiyuan Hu, Zheng Sun, Yi Wei, Long Yu

TL;DR
This paper introduces SPR-128K, a large dataset for evaluating spatial plausibility reasoning in multimodal large language models, and proposes a new training method that significantly improves their reasoning capabilities.
Contribution
It provides a comprehensive spatial reasoning dataset and a novel training approach, DPA-GRPO, to enhance MLLMs' spatial plausibility reasoning ability.
Findings
SPR-128K dataset effectively evaluates spatial reasoning.
DPA-GRPO improves model performance over standard methods.
Smaller models with DPA-GRPO outperform larger models in spatial reasoning.
Abstract
The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare, and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak spatial plausibility reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive spatial plausibility reasoning (SPR) dataset with over 128k samples, called SPR-128K. The dataset evaluates spatial plausibility reasoning ability under four aspects. Regarding data annotation, we investigate multiple approaches to acquire high-quality Chain-of-Thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
