VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model
Jinze Yang, Haoran Wang, Zining Zhu, Chenglong Liu, Meng Wymond Wu,, Mingming Sun

TL;DR
This paper introduces a versatile image outpainting framework that leverages multimodal large language models and a novel cross-attention module to enable customizable and resource-efficient image extrapolation, outperforming state-of-the-art methods.
Contribution
The work presents a novel, resource-efficient image outpainting method using MLLM for text-based customization and a new CTS module for enhanced spatial-textual interaction.
Findings
Outperforms state-of-the-art methods on three datasets.
Enables user-specific customization of outpainting results.
Requires only slight fine-tuning on existing diffusion models.
Abstract
In this paper, we focus on resolving the problem of image outpainting, which aims to extrapolate the surrounding parts given the center contents of an image. Although recent works have achieved promising performance, the lack of versatility and customization hinders their practical applications in broader scenarios. Therefore, this work presents a novel image outpainting framework that is capable of customizing the results according to the requirement of users. First of all, we take advantage of a Multimodal Large Language Model (MLLM) that automatically extracts and organizes the corresponding textual descriptions of the masked and unmasked part of a given image. Accordingly, the obtained text prompts are introduced to endow our model with the capacity to customize the outpainting results. In addition, a special Cross-Attention module, namely Center-Total-Surrounding (CTS), is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · AI in cancer detection
MethodsFocus · Diffusion
