Image Generation from Contextually-Contradictory Prompts
Saar Huberman, Or Patashnik, Omer Dahary, Ron Mokady, Daniel Cohen-Or

TL;DR
This paper introduces a stage-aware prompt decomposition framework that uses large language models to analyze and resolve contextual contradictions in prompts, significantly improving the semantic accuracy of generated images from diffusion models.
Contribution
It presents a novel prompt decomposition method guided by LLMs to handle contextual contradictions, enhancing text-to-image generation fidelity.
Findings
Improved semantic alignment in image generation from contradictory prompts
Effective use of LLMs for prompt analysis and rewriting
Enhanced control over denoising stages for better image quality
Abstract
Text-to-image diffusion models excel at generating high-quality, diverse images from natural language prompts. However, they often fail to produce semantically accurate results when the prompt contains concept combinations that contradict their learned priors. We define this failure mode as contextual contradiction, where one concept implicitly negates another due to entangled associations learned during training. To address this, we propose a stage-aware prompt decomposition framework that guides the denoising process using a sequence of proxy prompts. Each proxy prompt is constructed to match the semantic content expected to emerge at a specific stage of denoising, while ensuring contextual coherence. To construct these proxy prompts, we leverage a large language model (LLM) to analyze the target prompt, identify contradictions, and generate alternative expressions that preserve the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The biggest strength is the elegant simplicity of the approach. It doesn't require retraining the base model, making it easy to apply to existing systems like FLUX or SD3. I was particularly impressed by the use of an LLM as a "planner" to resolve semantic conflicts—it feels like a natural and powerful fit. The results are convincing, showing clear improvements on challenging benchmarks where other methods produce nonsensical hybrids. It's also a major plus that the method doesn't degrade perfor
The approach's reliance on the LLM is also its main weakness. The quality of the final image is now dependent on the LLM's ability to correctly diagnose the contradiction and generate sensible proxies, which might not always be robust. Additionally, the method can only work around the base model's limitations; it won't fix fundamental issues like incorrect object counts or strange anatomy if the diffusion model itself struggles with them. The process for defining the timestep intervals for switc
The paper proposes a training free approach for generating contextually contradictory prompts. The proposed approach outperforms FLUX, SD3.0, R2F, Ella, Annealing Guidance in both alignment and visual quality across multiple dataset. The qualitative examples also show the effectiveness of the approach.
1. The details on the attention mechanism in Figure 3 are not clear. 2. Construction of in-context examples: The prompts considered have limited diversity with variations in the object and the background 3. The approach considers limited variations in the visual style with respect to which the modifications are performed. For example, a single instance of multi-object scenario will be challenging 4. Comparison to approaches or negative prompting techniques: The paper focuses on progressive add
- **Clear figures.** Figures (e.g., Fig. 1, Fig. 5) effectively show the key motivation and the proposed pipeline. - **Strong results.** The method outperforms baselines by a meaningful margin in reported settings. - **Evaluation breadth.** The paper includes sufficient evaluation including human study.
1. **Problem prevalence** It’s unclear how prevalent “contextual contradiction” remains in SOTA systems. I tested gpt-image-1 with the prompt *“Bruce Lee is dressed in a yellow leotard and tutu practicing ballet”* and it performed reasonably well. Many evaluated backbones are ~1 year old. During the period, unified understanding and generation models, like Janus and Show-o surge and they often have great prompt alignment ability. Please evaluate on recent models using your contradiction prompts
1. The paper clearly articulates contextual contradiction as a specific and challenging problem, distinct from general compositionality failures. 2. Using an LLM to temporally decompose a prompt into a series of proxy prompts that align with the diffusion model's coarse-to-fine generation process is a reasonable, training-free solution. 3. The authors provide a robust and convincing evaluation.
1. The method's success is critically dependent on the quality of the LLM (GPT-4o) and the 20 manually-crafted in-context examples provided to it. The ablation in Table 3 shows that removing these examples causes a severe performance drop. This indicates the method is less of a general framework and more of a highly-effective prompt engineering strategy, which may not generalize to contradiction types not covered by the 20 examples. 2. The same VLM (GPT-4o) is used as a core component of the met
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
MethodsDiffusion
