L-MAGIC: Language Model Assisted Generation of Images with Coherence
Zhipeng Cai, Matthias Mueller, Reiner Birkl, Diana Wofk, Shao-Yen, Tseng, JunDa Cheng, Gabriela Ben-Melech Stan, Vasudev Lal, Michael Paulitsch

TL;DR
L-MAGIC is a novel zero-shot method that uses large language models to generate coherent 360-degree panoramic scenes from a single image, improving scene layout accuracy and view quality without fine-tuning.
Contribution
It introduces a new approach leveraging pre-trained language and diffusion models for panoramic scene generation, eliminating the need for fine-tuning and human input for each view.
Findings
Outperforms related methods in scene layout and view quality
Achieves over 70% preference in human evaluations
Supports multiple input modalities including text, depth, and sketches
Abstract
In the current era of generative AI breakthroughs, generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However, the lack of global scene layout priors leads to subpar outputs with duplicated objects (e.g., multiple beds in a bedroom) or requires time-consuming human text inputs for each view. We propose L-MAGIC, a novel method leveraging large language models for guidance while diffusing multiple coherent views of 360 degree panoramic scenes. L-MAGIC harnesses pre-trained diffusion and language models without fine-tuning, ensuring zero-shot performance. The output quality is further enhanced by super-resolution and multi-view fusion techniques. Extensive experiments demonstrate that the resulting panoramic scenes feature better scene layouts and perspective view…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications
MethodsDiffusion
