FlexGen: Flexible Multi-View Generation from Text and Image Inputs
Xinli Xu, Wenhang Ge, Jiantao Lin, Jiawei Feng, Lie Xu, HanFeng Zhao,, Shunsi Zhang, Ying-Cong Chen

TL;DR
FlexGen is a versatile framework that enables controllable multi-view image generation from text and images, leveraging 3D-aware annotations and GPT-4V to produce consistent, adjustable views for applications like gaming and virtual reality.
Contribution
The paper introduces FlexGen, a novel multi-view generation model that incorporates 3D-aware text annotations and adaptive control for enhanced flexibility and controllability.
Findings
Outperforms existing multi-view diffusion models in controllability
Supports modification of appearance and material attributes
Enables generation of unseen object views with high consistency
Abstract
In this work, we introduce FlexGen, a flexible framework designed to generate controllable and consistent multi-view images, conditioned on a single-view image, or a text prompt, or both. FlexGen tackles the challenges of controllable multi-view synthesis through additional conditioning on 3D-aware text annotations. We utilize the strong reasoning capabilities of GPT-4V to generate 3D-aware text annotations. By analyzing four orthogonal views of an object arranged as tiled multi-view images, GPT-4V can produce text annotations that include 3D-aware information with spatial relationship. By integrating the control signal with proposed adaptive dual-control module, our model can generate multi-view images that correspond to the specified text. FlexGen supports multiple controllable capabilities, allowing users to modify text prompts to generate reasonable and corresponding unseen parts.…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
+ FlexGen’s use of GPT-4V for 3D-aware captioning and the adaptive dual-control module offers flexibility in image synthesis, enabling detailed control over multi-view consistency and visual attributes.
- The main contribution appears to be the use of GPT-4V for generating detailed captions in multi-view synthesis. This application of existing technology lacks significant innovation and may not constitute a substantial advancement in multi-view generation. - Qualitative results in Figures 5 and 6 do not clearly demonstrate a marked advantage of FlexGen over existing methods. - Appendix Section A.2 lacks the corresponding figures and analysis that could further clarify the model’s performance an
- The use of GPT-4V for multi-view annotation is effective and shows promising results. - The paper explores an interesting approach by incorporating material properties into multi-view synthesis within generative models.
- The primary contributions rely on integrating existing techniques (GPT-4V for captioning and a previously established reference-guided mechanism), rather than proposing fundamentally new methodologies. - While the detailed captioning using GPT-4V is beneficial, the approach does not introduce a novel annotation strategy beyond leveraging GPT’s generative capacity. - The core of the proposed approach relies heavily on previously established methods, specifically the reference view guidance. The
1: The proposed framework is flexible, supporting generation from single-view images, text prompts, or both. This allows for versatile applications and user interactions. The adaptive dual-control module enables fine-grained control over various aspects of the generated multi-view images, including unseen parts, material properties, and textures, showcasing impressive controllability compared to existing methods. 2: The paper presents extensive experiments on the Objaverse and GSO datasets, dem
1: While GPT-4V enables rich 3D-aware annotations, generating these can be computationally expensive and relies on a proprietary model. Exploring open-source MLLMs for captioning could be valuable, potentially increasing accessibility and reducing dependence on closed-source solutions. The paper could benefit from discussing the trade-offs between annotation quality and computational cost when using different models. 2: The paper mentions occasional difficulties with complex user-defined instru
- They annotate Objaverse by GPT4v which will be a good addition to the community if the authors would like to open source.
- Limited technical novelty. ImageDream (Wang and Shi, 2023) trained a multi-view image generation from both image and text and showed similar capability. Note their method can also be used to add new unseen details at the back, check their opensourced code here: https://github.com/bytedance/ImageDream. I find the using both image and text and the shared attention mechanisms are close in two works. My suggestion: (1) compare to them technically; (2) show ablation study why your design is better
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsDiffusion
