SPR-128K: A New Benchmark for Spatial Plausibility Reasoning with Multimodal Large Language Models

Zhiyuan Hu; Zheng Sun; Yi Wei; Long Yu

arXiv:2505.23265·cs.CV·March 27, 2026

SPR-128K: A New Benchmark for Spatial Plausibility Reasoning with Multimodal Large Language Models

Zhiyuan Hu, Zheng Sun, Yi Wei, Long Yu

PDF

Open Access

TL;DR

This paper introduces SPR-128K, a large dataset for evaluating spatial plausibility reasoning in multimodal large language models, and proposes a new training method that significantly improves their reasoning capabilities.

Contribution

It provides a comprehensive spatial reasoning dataset and a novel training approach, DPA-GRPO, to enhance MLLMs' spatial plausibility reasoning ability.

Findings

01

SPR-128K dataset effectively evaluates spatial reasoning.

02

DPA-GRPO improves model performance over standard methods.

03

Smaller models with DPA-GRPO outperform larger models in spatial reasoning.

Abstract

The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare, and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak spatial plausibility reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive spatial plausibility reasoning (SPR) dataset with over 128k samples, called SPR-128K. The dataset evaluates spatial plausibility reasoning ability under four aspects. Regarding data annotation, we investigate multiple approaches to acquire high-quality Chain-of-Thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion