BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment
Dewei Zhou, Mingwei Li, Zongxin Yang, Yu Lu, Yunqiu Xu, Zhizhong Wang, Zeyi Huang, Yi Yang

TL;DR
BideDPO introduces a novel framework for conditional image generation that effectively resolves conflicts between text prompts and conditioning images through disentangled preference optimization and an iterative data-model refinement process.
Contribution
It proposes a bidirectionally decoupled DPO framework with adaptive loss balancing and an automated data pipeline to improve alignment in conditional image synthesis.
Findings
Significantly improves text success rates by +35%.
Enhances condition adherence in generated images.
Validates effectiveness on COCO dataset.
Abstract
Conditional image generation enhances text-to-image synthesis with structural, spatial, or stylistic priors, but current methods face challenges in handling conflicts between sources. These include 1) input-level conflicts, where the conditioning image contradicts the text prompt, and 2) model-bias conflicts, where generative biases disrupt alignment even when conditions match the text. Addressing these conflicts requires nuanced solutions, which standard supervised fine-tuning struggles to provide. Preference-based optimization techniques like Direct Preference Optimization (DPO) show promise but are limited by gradient entanglement between text and condition signals and lack disentangled training data for multi-constraint tasks. To overcome this, we propose a bidirectionally decoupled DPO framework (BideDPO). Our method creates two disentangled preference pairs-one for the condition…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper considers tackling a very practical problem of resolving conflicts between multiple conditionings, which is faced while utilising/controlling majority generative models. 2. The strongest claim is extending DPO to handle competing objectives by mitigating gradient entanglement. 3. The qualitative results shown in Figure 5 are of impressive quality, and prove the effectiveness of the 2 objectives working simultaneously.
1. The observed degradation at Iteration 4 suggests the model begins to overfit to the biases and narrow distribution of the data it generates itself? Please provide more intuition for why/why not this may be the case. 2. The main weakness is the unproven quality of the VLM-based preference scoring. If the VLM is not a good proxy for human preference on novel, abstract, or conflicting constraints, then the reported gains may be less strong.
- The paper introduces a disentangled preference-based optimization technique, which helps mitigate the frequent conflicts between conditional images and text prompts in conditional image generation tasks. - An automatic Disentangled, Conflict-Aware Preference DPO data pipeline is presented, streamlining the process of handling conflicting conditions. - The authors construct a DualAlign Benchmark, enabling robust evaluation of a model’s ability to resolve conflicts between visual and textual con
- The proposed method is quite straightforward and lacks significant novelty. - The comparison with state-of-the-art post-training methods is limited; the paper only benchmarks against DPO and SFT.
- Unlike naive DPO (which suffers from gradient entanglement between text and condition signals), BideDPO’s decoupled preference pairs and adaptive loss balancing provide clear, independent optimization signals for each objective. This design effectively addresses the "trade-off dilemma" in multi-constraint generation. - The automated data pipeline resolves the critical bottleneck of scarce conflict-aware DPO data for conditional generation. By leveraging LLMs for prompt generation and VLMs for
- The paper only compares BideDPO against naive DPO and supervised fine-tuning (SFT) on FLUX variants. It fails to include critical SOTA methods in conditional image generation, such as ControlNet++ (Li et al., 2024, which improves conditional control via consistency feedback), LooseControl (Bhat et al., 2024, for generalized depth conditioning), and OmniControlNet (Wang et al., 2024, for multi-modal control). Without these comparisons, BideDPO’s competitiveness in the broader conditional genera
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Multi-Objective Optimization Algorithms · Multimodal Machine Learning Applications
