Kosmos-G: Generating Images in Context with Multimodal Large Language   Models

Xichen Pan; Li Dong; Shaohan Huang; Zhiliang Peng; Wenhu Chen; Furu; Wei

arXiv:2310.02992·cs.CV·April 29, 2024·6 cites

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, Furu, Wei

PDF

Open Access 1 Repo

TL;DR

Kosmos-G is a multimodal model that enables zero-shot, subject-driven image generation with interleaved multi-image and text inputs, advancing the goal of treating images as a language for generation.

Contribution

It introduces Kosmos-G, a novel approach that aligns MLLMs with CLIP for flexible, zero-shot image generation without test-time tuning or decoder modifications.

Findings

01

Achieves zero-shot subject-driven generation with interleaved multi-image and text inputs

02

Seamlessly integrates with various U-Net techniques and personalized decoders

03

Requires no modifications to the image decoder during training

Abstract

Recent advancements in subject-driven image generation have made significant strides. However, current methods still fall short in diverse application scenarios, as they require test-time tuning and cannot accept interleaved multi-image and text input. These limitations keep them far from the ultimate goal of "image as a foreign language in image generation." This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input. Notably, the score distillation instruction tuning requires no…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/unilm/tree/master/kosmos-g
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Multimodal Machine Learning Applications · Topic Modeling

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Convolution · Contrastive Language-Image Pre-training · Max Pooling · U-Net