Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image   Generation

Minghui Hu; Jianbin Zheng; Daqing Liu; Chuanxia Zheng; Chaoyue Wang,; Dacheng Tao; Tat-Jen Cham

arXiv:2306.00964·cs.CV·March 1, 2024·2 cites

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

Minghui Hu, Jianbin Zheng, Daqing Liu, Chuanxia Zheng, Chaoyue Wang,, Dacheng Tao, Tat-Jen Cham

PDF

Open Access

TL;DR

Cocktail introduces a multi-modal control pipeline for text-conditional diffusion models, enabling refined spatial and multi-signal control for high-quality, faithful image generation.

Contribution

The paper presents a novel framework combining a hyper-network gControlNet, ControlNorm, and spatial guidance sampling to incorporate multiple control signals into diffusion models.

Findings

01

Effective multi-modal control of image generation

02

High fidelity and spatial accuracy in generated images

03

Flexible fusion of diverse control signals

Abstract

Text-conditional diffusion models are able to generate high-fidelity images with diverse contents. However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery, requiring the incorporation of additional control signals to bolster the efficacy of text-guided diffusion models. In this work, we propose Cocktail, a pipeline to mix various modalities into one embedding, amalgamated with a generalized ControlNet (gControlNet), a controllable normalisation (ControlNorm), and a spatial guidance sampling method, to actualize multi-modal and spatially-refined control for text-conditional diffusion models. Specifically, we introduce a hyper-network gControlNet, dedicated to the alignment and infusion of the control signals from disparate modalities into the pre-trained diffusion model. gControlNet is capable of accepting flexible modality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications

MethodsDiffusion