FlexEControl: Flexible and Efficient Multimodal Control for   Text-to-Image Generation

Xuehai He; Jian Zheng; Jacob Zhiyuan Fang; Robinson Piramuthu; Mohit; Bansal; Vicente Ordonez; Gunnar A Sigurdsson; Nanyun Peng; Xin Eric Wang

arXiv:2405.04834·cs.CV·May 24, 2024

FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit, Bansal, Vicente Ordonez, Gunnar A Sigurdsson, Nanyun Peng, Xin Eric Wang

PDF

Open Access

TL;DR

FlexEControl introduces a novel, efficient multimodal control method for text-to-image generation, significantly improving faithfulness and reducing computational costs while supporting multiple input modalities.

Contribution

The paper presents FlexEControl, a new weight decomposition strategy that streamlines multimodal conditioning, enhancing efficiency and faithfulness in controllable T2I generation.

Findings

01

Reduces trainable parameters by 41%

02

Cuts memory usage by 30%

03

Doubles data efficiency

Abstract

Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation. At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning. Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Video Analysis and Summarization

MethodsDiffusion