L-MAGIC: Language Model Assisted Generation of Images with Coherence

Zhipeng Cai; Matthias Mueller; Reiner Birkl; Diana Wofk; Shao-Yen; Tseng; JunDa Cheng; Gabriela Ben-Melech Stan; Vasudev Lal; Michael Paulitsch

arXiv:2406.01843·cs.CV·June 5, 2024·2 cites

L-MAGIC: Language Model Assisted Generation of Images with Coherence

Zhipeng Cai, Matthias Mueller, Reiner Birkl, Diana Wofk, Shao-Yen, Tseng, JunDa Cheng, Gabriela Ben-Melech Stan, Vasudev Lal, Michael Paulitsch

PDF

Open Access 1 Repo

TL;DR

L-MAGIC is a novel zero-shot method that uses large language models to generate coherent 360-degree panoramic scenes from a single image, improving scene layout accuracy and view quality without fine-tuning.

Contribution

It introduces a new approach leveraging pre-trained language and diffusion models for panoramic scene generation, eliminating the need for fine-tuning and human input for each view.

Findings

01

Outperforms related methods in scene layout and view quality

02

Achieves over 70% preference in human evaluations

03

Supports multiple input modalities including text, depth, and sketches

Abstract

In the current era of generative AI breakthroughs, generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However, the lack of global scene layout priors leads to subpar outputs with duplicated objects (e.g., multiple beds in a bedroom) or requires time-consuming human text inputs for each view. We propose L-MAGIC, a novel method leveraging large language models for guidance while diffusing multiple coherent views of 360 degree panoramic scenes. L-MAGIC harnesses pre-trained diffusion and language models without fine-tuning, ensuring zero-shot performance. The output quality is further enhanced by super-resolution and multi-view fusion techniques. Extensive experiments demonstrate that the resulting panoramic scenes feature better scene layouts and perspective view…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

intellabs/mmpano
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications

MethodsDiffusion