SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

Sashuai Zhou; Qiang Zhou; Junpeng Ma; Yue Cao; Ruofan Hu; Ziang Zhang; Xiaoda Yang; Zhibin Wang; Jun Song; Cheng Yu; Bo Zheng; Zhou Zhao

arXiv:2603.22228·cs.CV·March 24, 2026

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruofan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Zhou Zhao

PDF

Open Access

TL;DR

SpatialReward is a verifiable reward model that enhances fine-grained spatial accuracy in text-to-image generation by explicitly evaluating object placement and relations, leading to more consistent and human-aligned images.

Contribution

The paper introduces SpatialReward, a novel multi-stage reward model for assessing spatial layouts, and SpatRelBench, a comprehensive benchmark for evaluating spatial relationships in generated images.

Findings

01

Improves spatial consistency in generated images

02

Aligns generated images more closely with human judgments

03

Enhances overall quality of text-to-image synthesis

Abstract

Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Topic Modeling