Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Yusuf Dalva; Guocheng Gordon Qian; Maya Goldenberg; Tsai-Shien Chen; Kfir Aberman; Sergey Tulyakov; Pinar Yanardag; Kuan-Chieh Jackson Wang

arXiv:2511.21691·cs.CV·November 27, 2025

Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, Kuan-Chieh Jackson Wang

PDF

Open Access

TL;DR

Canvas-to-Image introduces a unified diffusion framework that integrates diverse multimodal controls into image generation, enabling faithful, high-quality compositional images through a novel multi-task training approach.

Contribution

It proposes a novel multi-task training strategy that allows diffusion models to jointly interpret and integrate heterogeneous control signals from a unified canvas interface.

Findings

01

Outperforms state-of-the-art in identity preservation and control adherence.

02

Excels in multi-person, pose-controlled, and layout-constrained image generation.

03

Generalizes well to multi-control scenarios during inference.

Abstract

While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Aesthetic Perception and Analysis