All-in-One Conditioning for Text-to-Image Synthesis

Hirunima Jayasekara; Chuong Huynh; Yixuan Ren; Christabel Acquaye; Abhinav Shrivastava

arXiv:2602.09165·cs.CV·February 11, 2026

All-in-One Conditioning for Text-to-Image Synthesis

Hirunima Jayasekara, Chuong Huynh, Yixuan Ren, Christabel Acquaye, Abhinav Shrivastava

PDF

Open Access

TL;DR

This paper introduces a scene graph-based, zero-shot conditioning method for text-to-image synthesis that improves compositional accuracy and diversity by guiding diffusion models with soft visual cues during inference.

Contribution

It presents the ASQL Conditioner, a novel lightweight, scene graph-based conditioning mechanism that enhances the flexibility and coherence of text-to-image generation.

Findings

01

Improved semantic fidelity in complex prompt synthesis

02

Enhanced diversity and coherence in generated images

03

Effective zero-shot scene graph conditioning during inference

Abstract

Accurate interpretation and visual representation of complex prompts involving multiple objects, attributes, and spatial relationships is a critical challenge in text-to-image synthesis. Despite recent advancements in generating photorealistic outputs, current models often struggle with maintaining semantic fidelity and structural coherence when processing intricate textual inputs. We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures, aiming to enhance the compositional abilities of existing models. Eventhough, prior approaches have attempted to address this by using pre-defined layout maps derived from prompts, such rigid constraints often limit compositional flexibility and diversity. In contrast, we introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference. At the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Computer Graphics and Visualization Techniques