Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images
JiaKui Hu, Shanshan Zhao, Qing-Guo Chen, Xuerui Qiu, Jialun Liu, Zhao Xu, Weihua Luo, Kaifu Zhang, Yanye Lu

TL;DR
Omni-View introduces a unified model that enhances 3D scene understanding by integrating generation and understanding tasks through multiview images, achieving state-of-the-art results in scene comprehension and synthesis.
Contribution
It presents a novel multimodal framework combining understanding, texture, and geometry modules for improved 3D scene modeling from multiview images.
Findings
Achieved 55.4 score on VSI-Bench benchmark.
Outperformed existing 3D understanding models.
Delivered strong performance in view synthesis and scene generation.
Abstract
This paper presents Omni-View, which extends the unified multimodal understanding and generation to 3D scenes based on multiview images, exploring the principle that "generation facilitates understanding". Consisting of understanding model, texture module, and geometry module, Omni-View jointly models scene understanding, novel view synthesis, and geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks. By design, it leverages the spatiotemporal modeling capabilities of its texture module responsible for appearance synthesis, alongside the explicit geometric constraints provided by its dedicated geometry module, thereby enriching the model's holistic understanding of 3D scenes. Trained with a two-stage strategy, Omni-View achieves a state-of-the-art score of 55.4 on the VSI-Bench benchmark, outperforming existing specialized 3D…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper presents a unified 3D understanding–generation framework that cleanly separates texture and geometry, a simple yet original design that operationalizes “generation facilitates understanding.” - The two-stage recipe with dense-to-sparse curriculum and autoregressive NVS is well-motivated, technically sound, and shows careful loss design and gradient routing to benefit the understanding model. - Writing is clear and structured, with concrete training details, datasets, metrics, and abl
- Limited novelty relative to prior unified frameworks (Bagel, VILA-U, BLIP3o, Harmon) The core idea of leveraging generation to aid understanding has precedents in 2D unified models and recent 3D works that inject reconstruction priors (e.g., Ross3D; VG-LLM/Spatial-MLLM via VGGT features). The split into texture vs. geometry resembles established “appearance vs. structure” decouplings in 3D pipelines (e.g., ViewCrafter, Voyager). Clarify what is fundamentally new beyond integrating these piec
- This paper demonstrates that generative 3D tasks (novel view synthesis, geometry estimation) can actively enhance 3D scene understanding, rather than being separate objectives. - This paper has a unified architecture for 3D reasoning, with separate texture and geometry modules allow complementary learning of appearance and spatial structure, leading to better localization, spatial reasoning, and depth-aware Q&A. - This paper outperforms specialized models in 3D understanding benchmarks while
- It would be beneficial to include a diagram that more precisely illustrates the functionality of each module and the architecture, compared to the current version. - Additionally, visualizing 3D scene understanding / spatial reasoning / NVS from a single view as a video could also be an effective way to present the capabilities of the system. - Is there a reason why you refer to Texture Module and Geometry Module in the equations (e.g., (eq. 1), (eq. 2)) without using italics? Also, I believ
1. Clear intuition and solid empirical validation. The paper builds upon a clear and intuitive idea — that generation can facilitate understanding — and the overall logic is easy to follow. Quantitative results across multiple benchmarks convincingly demonstrate the benefits of the proposed design, especially in spatial reasoning and novel view synthesis. 2. Architectural innovation. By decomposing the generation process into texture and geometry modules, the authors present a meaningful and mod
1. The qualitative results in the appendix are sparse, and there are no depth estimation visualizations or broader test cases. This makes it difficult to verify the model’s generalization and effectiveness beyond the reported metrics. For instance, the quality and consistency of metric-scale prediction from the geometry module remain uncertain — the reported results could be influenced by selective visualization or data bias, since the paper lacks convincing examples that demonstrate accurate ge
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Face recognition and analysis
