Kosmos-G: Generating Images in Context with Multimodal Large Language Models
Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, Furu, Wei

TL;DR
Kosmos-G is a multimodal model that enables zero-shot, subject-driven image generation with interleaved multi-image and text inputs, advancing the goal of treating images as a language for generation.
Contribution
It introduces Kosmos-G, a novel approach that aligns MLLMs with CLIP for flexible, zero-shot image generation without test-time tuning or decoder modifications.
Findings
Achieves zero-shot subject-driven generation with interleaved multi-image and text inputs
Seamlessly integrates with various U-Net techniques and personalized decoders
Requires no modifications to the image decoder during training
Abstract
Recent advancements in subject-driven image generation have made significant strides. However, current methods still fall short in diverse application scenarios, as they require test-time tuning and cannot accept interleaved multi-image and text input. These limitations keep them far from the ultimate goal of "image as a foreign language in image generation." This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input. Notably, the score distillation instruction tuning requires no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Multimodal Machine Learning Applications · Topic Modeling
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Convolution · Contrastive Language-Image Pre-training · Max Pooling · U-Net
