MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

Kunpeng Song; Yizhe Zhu; Bingchen Liu; Qing Yan; Ahmed Elgammal; Xiao; Yang

arXiv:2404.05674·cs.CV·April 9, 2024·3 cites

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, Xiao, Yang

PDF

Open Access 1 Repo 1 Models

TL;DR

MoMA introduces a training-free, open-vocabulary personalized image generation model that leverages multimodal large language models and a novel self-attention shortcut to produce high-fidelity, identity-preserving images from a single reference.

Contribution

It presents MoMA, a novel, tuning-free, plug-and-play module that enhances personalized image generation using multimodal LLMs and a new feature transfer method.

Findings

01

Outperforms existing methods in detail fidelity and identity preservation.

02

Requires only a single reference image for personalized generation.

03

Open-source implementation available.

Abstract

In this paper, we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. As foundational text-to-image models rapidly evolve, the demand for robust image-to-image translation grows. Addressing this need, MoMA specializes in subject-driven personalized image generation. Utilizing an open-source, Multimodal Large Language Model (MLLM), we train MoMA to serve a dual role as both a feature extractor and a generator. This approach effectively synergizes reference image and text prompt information to produce valuable image features, facilitating an image diffusion model. To better leverage the generated features, we further introduce a novel self-attention shortcut method that efficiently transfers image features to an image diffusion model, improving the resemblance of the target object in generated images. Remarkably, as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/MoMA
pytorchOfficial

Models

🤗
KunpengSong/MoMA_llava_7b
model· 234 dl· ♡ 18
234 dl♡ 18

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Advanced Data Compression Techniques

MethodsDiffusion