VIP: Versatile Image Outpainting Empowered by Multimodal Large Language   Model

Jinze Yang; Haoran Wang; Zining Zhu; Chenglong Liu; Meng Wymond Wu,; Mingming Sun

arXiv:2406.01059·cs.CV·December 2, 2024

VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model

Jinze Yang, Haoran Wang, Zining Zhu, Chenglong Liu, Meng Wymond Wu,, Mingming Sun

PDF

Open Access 1 Repo

TL;DR

This paper introduces a versatile image outpainting framework that leverages multimodal large language models and a novel cross-attention module to enable customizable and resource-efficient image extrapolation, outperforming state-of-the-art methods.

Contribution

The work presents a novel, resource-efficient image outpainting method using MLLM for text-based customization and a new CTS module for enhanced spatial-textual interaction.

Findings

01

Outperforms state-of-the-art methods on three datasets.

02

Enables user-specific customization of outpainting results.

03

Requires only slight fine-tuning on existing diffusion models.

Abstract

In this paper, we focus on resolving the problem of image outpainting, which aims to extrapolate the surrounding parts given the center contents of an image. Although recent works have achieved promising performance, the lack of versatility and customization hinders their practical applications in broader scenarios. Therefore, this work presents a novel image outpainting framework that is capable of customizing the results according to the requirement of users. First of all, we take advantage of a Multimodal Large Language Model (MLLM) that automatically extracts and organizes the corresponding textual descriptions of the masked and unmasked part of a given image. Accordingly, the obtained text prompts are introduced to endow our model with the capacity to customize the outpainting results. In addition, a special Cross-Attention module, namely Center-Total-Surrounding (CTS), is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ucasyjz/VIP
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · AI in cancer detection

MethodsFocus · Diffusion