MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, Xiao, Yang

TL;DR
MoMA introduces a training-free, open-vocabulary personalized image generation model that leverages multimodal large language models and a novel self-attention shortcut to produce high-fidelity, identity-preserving images from a single reference.
Contribution
It presents MoMA, a novel, tuning-free, plug-and-play module that enhances personalized image generation using multimodal LLMs and a new feature transfer method.
Findings
Outperforms existing methods in detail fidelity and identity preservation.
Requires only a single reference image for personalized generation.
Open-source implementation available.
Abstract
In this paper, we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. As foundational text-to-image models rapidly evolve, the demand for robust image-to-image translation grows. Addressing this need, MoMA specializes in subject-driven personalized image generation. Utilizing an open-source, Multimodal Large Language Model (MLLM), we train MoMA to serve a dual role as both a feature extractor and a generator. This approach effectively synergizes reference image and text prompt information to produce valuable image features, facilitating an image diffusion model. To better leverage the generated features, we further introduce a novel self-attention shortcut method that efficiently transfers image features to an image diffusion model, improving the resemblance of the target object in generated images. Remarkably, as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Advanced Data Compression Techniques
MethodsDiffusion
