SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation
Leigang Qu, Haochuan Li, Wenjie Wang, Xiang Liu, Juncheng Li, Liqiang, Nie, Tat-Seng Chua

TL;DR
SILMM introduces a self-improving framework for large multimodal models that enhances text-image alignment in compositional scenarios through iterative self-feedback and direct preference optimization, improving performance significantly.
Contribution
The paper presents a model-agnostic iterative self-improvement framework (SILMM) that enhances LMMs' text-image alignment using self-feedback and DPO, adaptable to both discrete and continuous visual representations.
Findings
Over 30% improvement on T2I-CompBench++
Around 20% improvement on DPG-Bench
Effective self-improvement in compositional text-to-image tasks
Abstract
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation, pushing forward advancements in text-to-image generation. However, achieving accurate text-image alignment for LMMs, particularly in compositional scenarios, remains challenging. Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering, costly human annotations, and continual upgrading, limiting flexibility and scalability. In this work, we introduce a model-agnostic iterative self-improvement framework (SILMM) that can enable LMMs to provide helpful and scalable self-feedback and optimize text-image alignment via Direct Preference Optimization (DPO). DPO can readily applied to LMMs that use discrete visual tokens as intermediate image representations; while it is less…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies
MethodsDirect Preference Optimization
