Why Settle for Mid: A Probabilistic Viewpoint to Spatial Relationship Alignment in Text-to-image Models

Parham Rezaei; Arash Marioriyad; Mahdieh Soleymani Baghshah; Mohammad Hossein Rohban

arXiv:2506.23418·cs.CV·July 1, 2025

Why Settle for Mid: A Probabilistic Viewpoint to Spatial Relationship Alignment in Text-to-image Models

Parham Rezaei, Arash Marioriyad, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

PDF

TL;DR

This paper introduces a probabilistic framework and new evaluation and generation methods to improve spatial relationship accuracy in text-to-image models, aligning generated images more closely with input prompts and human judgment.

Contribution

It presents a novel PoS-based evaluation metric and an inference-time PoS-based generation method that enhance spatial relationship alignment without model fine-tuning.

Findings

01

PSE correlates better with human judgment than traditional metrics.

02

PSG improves spatial configuration accuracy in generated images.

03

Outperforms state-of-the-art methods across multiple benchmarks.

Abstract

Despite the ability of text-to-image models to generate high-quality, realistic, and diverse images, they face challenges in compositional generation, often struggling to accurately represent details specified in the input prompt. A prevalent issue in compositional generation is the misalignment of spatial relationships, as models often fail to faithfully generate images that reflect the spatial configurations specified between objects in the input prompts. To address this challenge, we propose a novel probabilistic framework for modeling the relative spatial positioning of objects in a scene, leveraging the concept of Probability of Superiority (PoS). Building on this insight, we make two key contributions. First, we introduce a novel evaluation metric, PoS-based Evaluation (PSE), designed to assess the alignment of 2D and 3D spatial relationships between text and image, with improved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.