Information Theoretic Text-to-Image Alignment
Chao Wang, Giulio Franzese, Alessandro Finamore, Massimo Gallo, Pietro, Michiardi

TL;DR
This paper introduces a novel method for aligning text-to-image diffusion models with user intentions by leveraging mutual information estimation, avoiding complex linguistic analysis or auxiliary models, and achieving superior results with minimal additional components.
Contribution
The paper proposes a mutual information-based fine-tuning approach that improves T2I model alignment without relying on external linguistic tools or vision-language models.
Findings
Outperforms state-of-the-art alignment methods
Requires only the pre-trained denoising network for MI estimation
Maintains high image quality while improving alignment
Abstract
Diffusion models for Text-to-Image (T2I) conditional generation have recently achieved tremendous success. Yet, aligning these models with user's intentions still involves a laborious trial-and-error process, and this challenging alignment problem has attracted considerable attention from the research community. In this work, instead of relying on fine-grained linguistic analyses of prompts, human annotation, or auxiliary vision-language models, we use Mutual Information (MI) to guide model alignment. In brief, our method uses self-supervised fine-tuning and relies on a point-wise (MI) estimation between prompts and images to create a synthetic fine-tuning set for improving model alignment. Our analysis indicates that our method is superior to the state-of-the-art, yet it only requires the pre-trained denoising network of the T2I model itself to estimate MI, and a simple fine-tuning…
Peer Reviews
Decision·ICLR 2025 Poster
1. The proposed method does not require additional image datasets for training. 2. The idea is relatively novel.
1. Comparison methods, "Attend and Excite (A&E) (Chefer et al., 2023b), Structured Diffusion Guidance(SDG) (Feng et al., 2023b) and Semantic-aware Classifier-Free Guidance (SCG) (Shen et al., 2024)" mentioned in line 329, is implemented on SD1.4, but the proposed work is implemented on SD 2.1, which is powerful than SD1.4 and may introducing evaluation bias. 2. The experiments, the train set, and the test set are split from the same dataset, but this may exist some correlations, what's the resu
- The idea of introducing self-supervised fine-tuning manner is interesting. - Mutual Information in the pipeline is simple and effective. - It seems is a plug-and-play module, which is useful for most T2I models.
- More detailed ablations are needed. The authors employ MI as the metric to select fine-tuning samples, which eliminates the extra usage of other models. However, what if we use SOTA VQA models as the metric? Intuitively, SOTA VQA models are more precise than the MI metric. - An inherent drawback of MI is that it can measure how much help comes from the prompt but cannot guide in the right direction. For example, in cases of color misalignment, how should we deal with this issue?
1. The main problem "Is mutual information meaningful for alignment?" is compelling and necessary, showing the potential of MI as a new direction for text-image alignment 2. The paper is well-organized and easy to follow. 3. The proposed fine-tuning approach is intersting and seems to have predictive power.
1. MI scores are missing from comparison tables and images despite being central to the method. 2. Figure 1 only demonstrates MI effectiveness on simple category prompts (color, texture, shape), lacking validation on more challenging cases like spatial relationships or complex compositions
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques · Image Processing and 3D Reconstruction
MethodsSparse Evolutionary Training
