OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation
Yoonjin Oh, Yongjin Kim, Hyomin Kim, Donghwan Chi, Sungwoong Kim

TL;DR
OSPO is a novel self-improving framework that enhances object-level text-image alignment in multimodal models by explicitly constructing object-centric preferences and using attention-based object masks, leading to significant improvements.
Contribution
The paper introduces OSPO, a self-improving method that constructs object-centric preference data without external resources and employs attention-based masks for better object fidelity.
Findings
Significantly improves fine-grained text-image alignment.
Reduces object hallucination in generated images.
Outperforms prior self-improving methods and diffusion models.
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have enabled unified multimodal understanding and generation. However, they still struggle with fine-grained text-image alignment, often failing to faithfully depict objects with correct attributes such as color, shape, and spatial relations. To mitigate this issue, previous studies have explored preference optimization methods such as DPO and GRPO, but these approaches incur substantial computational cost, both in constructing preference data and in performing optimization. This has motivated self-improving preference optimization approaches, in which the MLLM autonomously generates its own training data, self-estimates preference feedback, and self-optimizes using the resulting self-constructed preference pairs. However, existing self-improving methods still overlook fine-grained, object-level semantics, allowing object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Multimodal Machine Learning Applications · Artificial Intelligence in Games
