CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Dongzhi Jiang, Guanglu Song, Xiaoshi Wu, Renrui Zhang, Dazhong Shen,, Zhuofan Zong, Yu Liu, Hongsheng Li

TL;DR
This paper introduces CoMat, a fine-tuning strategy for diffusion models that improves text-image alignment by addressing token attention issues and utilizing image-to-text concept matching, achieving state-of-the-art results.
Contribution
We propose CoMat, an end-to-end fine-tuning approach with an image-to-text matching mechanism and attribute concentration module to enhance text-image alignment in diffusion models.
Findings
CoMat significantly outperforms baseline SDXL in alignment benchmarks.
The method improves token attention activation and attribute binding.
State-of-the-art performance achieved without human preference data.
Abstract
Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation. We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm. To address the issue, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens. A novel attribute concentration module is also proposed to address the attribute binding problem. Without any image or human preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsDiffusion
