OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation

Yoonjin Oh; Yongjin Kim; Hyomin Kim; Donghwan Chi; Sungwoong Kim

arXiv:2506.02015·cs.CV·March 6, 2026

OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation

Yoonjin Oh, Yongjin Kim, Hyomin Kim, Donghwan Chi, Sungwoong Kim

PDF

Open Access 3 Models

TL;DR

OSPO is a novel self-improving framework that enhances object-level text-image alignment in multimodal models by explicitly constructing object-centric preferences and using attention-based object masks, leading to significant improvements.

Contribution

The paper introduces OSPO, a self-improving method that constructs object-centric preference data without external resources and employs attention-based masks for better object fidelity.

Findings

01

Significantly improves fine-grained text-image alignment.

02

Reduces object hallucination in generated images.

03

Outperforms prior self-improving methods and diffusion models.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled unified multimodal understanding and generation. However, they still struggle with fine-grained text-image alignment, often failing to faithfully depict objects with correct attributes such as color, shape, and spatial relations. To mitigate this issue, previous studies have explored preference optimization methods such as DPO and GRPO, but these approaches incur substantial computational cost, both in constructing preference data and in performing optimization. This has motivated self-improving preference optimization approaches, in which the MLLM autonomously generates its own training data, self-estimates preference feedback, and self-optimizes using the resulting self-constructed preference pairs. However, existing self-improving methods still overlook fine-grained, object-level semantics, allowing object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Multimodal Machine Learning Applications · Artificial Intelligence in Games