SILMM: Self-Improving Large Multimodal Models for Compositional   Text-to-Image Generation

Leigang Qu; Haochuan Li; Wenjie Wang; Xiang Liu; Juncheng Li; Liqiang; Nie; Tat-Seng Chua

arXiv:2412.05818·cs.CV·March 26, 2025

SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation

Leigang Qu, Haochuan Li, Wenjie Wang, Xiang Liu, Juncheng Li, Liqiang, Nie, Tat-Seng Chua

PDF

Open Access

TL;DR

SILMM introduces a self-improving framework for large multimodal models that enhances text-image alignment in compositional scenarios through iterative self-feedback and direct preference optimization, improving performance significantly.

Contribution

The paper presents a model-agnostic iterative self-improvement framework (SILMM) that enhances LMMs' text-image alignment using self-feedback and DPO, adaptable to both discrete and continuous visual representations.

Findings

01

Over 30% improvement on T2I-CompBench++

02

Around 20% improvement on DPG-Bench

03

Effective self-improvement in compositional text-to-image tasks

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation, pushing forward advancements in text-to-image generation. However, achieving accurate text-image alignment for LMMs, particularly in compositional scenarios, remains challenging. Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering, costly human annotations, and continual upgrading, limiting flexibility and scalability. In this work, we introduce a model-agnostic iterative self-improvement framework (SILMM) that can enable LMMs to provide helpful and scalable self-feedback and optimize text-image alignment via Direct Preference Optimization (DPO). DPO can readily applied to LMMs that use discrete visual tokens as intermediate image representations; while it is less…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies

MethodsDirect Preference Optimization