MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal   Image Generation

Marco Bellagente; Manuel Brack; Hannah Teufel; Felix Friedrich,; Bj\"orn Deiseroth; Constantin Eichenberg; Andrew Dai; Robert Baldock,; Souradeep Nanda; Koen Oostermeijer; Andres Felipe Cruz-Salinas; Patrick; Schramowski; Kristian Kersting; Samuel Weinbach

arXiv:2305.15296·cs.CV·December 21, 2023·6 cites

MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

Marco Bellagente, Manuel Brack, Hannah Teufel, Felix Friedrich,, Bj\"orn Deiseroth, Constantin Eichenberg, Andrew Dai, Robert Baldock,, Souradeep Nanda, Koen Oostermeijer, Andres Felipe Cruz-Salinas, Patrick, Schramowski, Kristian Kersting, Samuel Weinbach

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

MultiFusion enables complex, nuanced image generation from interleaved multilingual and multimodal inputs by fusing pre-trained models, avoiding extensive retraining and demonstrating effective capability transfer.

Contribution

It introduces a novel fusion framework that combines pre-trained models for multilingual and multimodal image generation without extensive retraining.

Findings

01

Effective transfer of capabilities from individual modules

02

Supports interleaved multilingual and multimodal inputs

03

Operates with models trained on monomodal, single-language data

Abstract

The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aleph-alpha/multifusion
none

Datasets

AIML-TUDA/MCC-250
dataset· 60 dl
60 dl

Videos

MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation· slideslive

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsDiffusion