OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes
Yukun Huang, Jiwen Yu, Yanning Zhou, Jianan Wang, Xintao Wang, Pengfei Wan, Xihui Liu

TL;DR
OmniX leverages 2D generative models to produce immersive, realistic, and graphics-ready 3D scenes from panoramic perception, enabling advanced virtual environment creation with a unified framework.
Contribution
The paper introduces OmniX, a novel framework that repurposes 2D generative priors for panoramic perception and 3D scene generation, bridging appearance and intrinsic property understanding.
Findings
Effective panoramic perception and scene generation demonstrated.
High-quality multimodal panorama dataset constructed.
Enables immersive and physically realistic virtual worlds.
Abstract
There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This paper proposes a unified framework, OmniX, which repurposes pre-trained 2D flow matching models for panoramic image generation, perception, and completion tasks, demonstrating strong versatility and extensibility. 2. OmniX constructs a high-quality, multimodal panoramic dataset, PanoX, covering both indoor and outdoor scenes and providing dense geometry and material annotations, addressing a data gap in the current field. 3. This paper further proposes and compares various cross-modal a
1. Despite excellent performance in material estimation, the method still falls short of specialized approaches (e.g., MoGe) in depth estimation. The reconstructed 3D surfaces exhibit unevenness, which adversely affects subsequent rendering quality. 2. The authors note that the metallic prediction model has weak generalization capability. This is partly due to the scarcity of metallic material samples in the training data, also reflecting the limitations of 2D image priors for PBR material estim
1. Experiments on the PanoX, Structured3D, and HDR360-UHD datasets demonstrate that OmniX outperforms state-of-the-art (SOTA) methods in panoramic intrinsic decomposition. For instance, it achieves an albedo PSNR of 17.755 compared to 10.906 from DiffusionRenderer , while also achieving competitive performance in geometric estimation. 2. The paper proposes a novel unified paradigm, OmniX , which effectively integrates panoramic perception, generation, and completion into a single 2D flow-matc
1. The paper's geometric validation is superficial. It relies solely on 2D metrics (e.g., AbsRel, angular error) for depth and normal maps, which fails to prove the model has learned true 3D spatial information. 2. The work lacks crucial 3D-level validation, such as: - Multi-view geometric consistency tests (e.g., verifying an object's size and shape remain consistent from different viewpoints). - Analysis of learned geometric features to show the model understands distinct 3D shapes. 3. Addi
1. This work makes a valuable contribution to panoramic perception by providing a synthetic dataset that supports unified panoramic perception training also a model to do multimodal generation in the panoramic domain. 2. The writing is clear, well-organized, and makes the paper easy to follow throughout.
Lack of Novelty: The proposed methodology lacks sufficient novelty, as similar techniques have already been widely explored in the field of perspective image generation [1]. Essentially, the method appears to adapt existing approaches to the panoramic domain without introducing substantial methodological innovation. Limited Comparison and Evaluation: The proposed method is evaluated only on the authors’ self-constructed PANOX dataset. Although the dataset is divided into training, validation, a
1. I like their motivation to infer multiple physically grounded modalities for enhanced scene generation and understanding. Extending this exploration to panoramic imagery is particularly relevant, as it can greatly benefit embodied perception and related downstream tasks. 2. The demonstrated downstream applications, such as relighting and physics-based simulation, are strong showcases of the practical value of the generated modalities. It is exciting that these capabilities are enabled direct
1. Insufficient analysis on performance gains: While the proposed model outperforms baseline methods, the paper does not clearly articulate why the design leads to better results. The architecture appears to be a relatively straightforward multimodal DiT extension, and it remains unclear which component contributes most to the improvements. For instance, is the gain primarily from leveraging flow-matching priors, conditioning design, or other architectural nuances? A deeper analysis or ablation
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
