CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept   Matching

Dongzhi Jiang; Guanglu Song; Xiaoshi Wu; Renrui Zhang; Dazhong Shen,; Zhuofan Zong; Yu Liu; Hongsheng Li

arXiv:2404.03653·cs.CV·November 28, 2024·6 cites

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Dongzhi Jiang, Guanglu Song, Xiaoshi Wu, Renrui Zhang, Dazhong Shen,, Zhuofan Zong, Yu Liu, Hongsheng Li

PDF

Open Access 2 Repos

TL;DR

This paper introduces CoMat, a fine-tuning strategy for diffusion models that improves text-image alignment by addressing token attention issues and utilizing image-to-text concept matching, achieving state-of-the-art results.

Contribution

We propose CoMat, an end-to-end fine-tuning approach with an image-to-text matching mechanism and attribute concentration module to enhance text-image alignment in diffusion models.

Findings

01

CoMat significantly outperforms baseline SDXL in alignment benchmarks.

02

The method improves token attention activation and attribute binding.

03

State-of-the-art performance achieved without human preference data.

Abstract

Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation. We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm. To address the issue, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens. A novel attribute concentration module is also proposed to address the attribute binding problem. Without any image or human preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsDiffusion