Dual Caption Preference Optimization for Diffusion Models
Amir Saeidi, Yiran Luo, Agneet Chatterjee, Shamanthak Hegde, Bimsara Pathiraja, Yezhou Yang, Chitta Baral

TL;DR
This paper introduces Dual Caption Preference Optimization (DCPO), a novel framework that enhances diffusion model training by assigning two distinct captions per preference pair, leading to improved image quality and relevance.
Contribution
The paper proposes DCPO, a new data augmentation and optimization method that strengthens preference signals in diffusion models by using dual captions for each preference pair.
Findings
DCPO outperforms existing methods on multiple metrics.
Constructed Pick-Double Caption dataset with distinct captions.
Significant improvements in image quality and prompt relevance.
Abstract
Recent advancements in human preference optimization, originally developed for Large Language Models (LLMs), have shown significant potential in improving text-to-image diffusion models. These methods aim to learn the distribution of preferred samples while distinguishing them from less preferred ones. However, within the existing preference datasets, the original caption often does not clearly favor the preferred image over the alternative, which weakens the supervision signal available during training. To address this issue, we introduce Dual Caption Preference Optimization (DCPO), a data augmentation and optimization framework that reinforces the learning signal by assigning two distinct captions to each preference pair. This encourages the model to better differentiate between preferred and less-preferred outcomes during training. We also construct Pick-Double Caption, a modified…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The paper is well-organized and easy to follow. Figures are clear to read, such as Figure 2. 2. The story is complete: they propose hypothesis and then use experimental results to verify them in Sec 3.3 with clear ablation studies. 3. The problem setup is clear. They also provide enough details to reproduce the work.
1. My biggest concern is about the generalization of the approach method in the development of diffusion models. For example, in Figure 2, it is easy to distinguish the preferred and less-preferred image as the latter one even does not align with the original prompt. What if the model's development is already beyond the alignment stage? The current positive/negative samples are only about alignment, what about more advanced difference if both have enough alignment? 2. Line 188-189, could you e
The paper is well written with right amount of details in both main text and appendix. The proposed method is clear, and relativly straightforward to implement. On a popular open source diffusion model ( SD 2.1), several experiments are done to ablate the design details of the proposed approach. The used set of metrics are comprehensive, including both single side evaluation such as HPSv2, as well as side by side evaluation such as the one using GPT4-o as judge.
The motivation behind the proposed approach is not clear to me. For the conflict distribution challenge, when the distribution overlap becomes larger, the dataset is proposing a harder problem for the model to optimize, but it isn't necessary an issue as long as the two distributions are not identical. When the diffusion models's quality gets better, the two distribution will inevitably become more and more similar, as both preferred and less preferred images from an optimized model will be clo
1. The dual caption framework is reasonable. DCPO introduces a dual-caption system that effectively addresses the problem of overlapping distributions in existing datasets. 2. This paper achieves better performance. Demonstrated improvements across multiple metrics (e.g., Pickscore, CLIPscore) and benchmarks (e.g., GenEval) show that DCPO enhances image quality and relevance significantly. 3. The experimental results are analyzed in detail. The paper includes extensive quantitative and qualitati
1. The proposed method depends on the caption quality. The quality of generated captions significantly affects performance, and challenges remain in creating effective captions for less preferred images without straying out-of-distribution. 2. While DCPO demonstrates quantitative improvements across several metrics, the qualitative results (e.g., Figure 1) indicate that the visual distinctions between images generated by DCPO and baseline methods are not significant. This subtle difference may l
As a reviewer from a broader field, I am not very familiar with the specific domain of this paper. Therefore, I am reviewing this paper from a generalist’s perspective. The strengths of this paper are: 1. It provides sufficient theoretical support for the motivation, which aligns well with the characteristics of ICLR papers. 2. The issues raised seem quite reasonable. 3. Extensive quantitative and qualitative experiments support the arguments presented.
However, I still have a few concerns: 1. The issues of conflict distribution and irrelevant prompts seem like two aspects of the same problem—both involve a single prompt (C) corresponding to two different images, which can lead to unstable optimization. Therefore, I think they could be consolidated into a single issue. 2. When comparing generated images, the improvements achieved by the proposed method could be highlighted more clearly; otherwise, it’s often not immediately obvious, as in Figu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedia, Gender, and Advertising
MethodsDiffusion
