Enhancing Spatial Understanding in Image Generation via Reward Modeling

Zhenyu Tang; Chaoran Feng; Yufan Deng; Jie Wu; Xiaojie Li; Rui Wang; Yunpeng Chen; Daquan Zhou

arXiv:2602.24233·cs.CV·March 2, 2026

Enhancing Spatial Understanding in Image Generation via Reward Modeling

Zhenyu Tang, Chaoran Feng, Yufan Deng, Jie Wu, Xiaojie Li, Rui Wang, Yunpeng Chen, Daquan Zhou

PDF

Open Access

TL;DR

This paper introduces a reward model that improves the spatial understanding of text-to-image generation models, leading to more accurate spatial relationships and reducing the need for multiple sampling attempts.

Contribution

We created the SpatialReward-Dataset and SpatialScore, a reward model that enhances spatial accuracy in image generation through reinforcement learning.

Findings

01

SpatialScore surpasses leading models in spatial evaluation.

02

Reward model improves spatial accuracy across multiple benchmarks.

03

Reinforcement learning with the reward model yields consistent spatial understanding gains.

Abstract

Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Artificial Intelligence in Games