NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction
Weirong Chen, Chuanxia Zheng, Ganlin Zhang, Andrea Vedaldi, Daniel Cremers

TL;DR
NOVA3R introduces a global, view-agnostic scene representation for 3D reconstruction from unposed images, overcoming pixel-alignment limitations and improving completeness and accuracy in scene and object reconstruction.
Contribution
It proposes a novel non-pixel-aligned approach with a scene-token mechanism and diffusion-based decoder, enabling better reconstruction of visible and invisible scene points.
Findings
Outperforms state-of-the-art in accuracy and completeness
Recovers both visible and invisible scene points
Produces physically plausible geometries
Abstract
We present NOVA3R, an effective approach for non-pixel-aligned 3D reconstruction from a set of unposed images in a feed-forward manner. Unlike pixel-aligned methods that tie geometry to per-ray predictions, our formulation learns a global, view-agnostic scene representation that decouples reconstruction from pixel alignment. This addresses two key limitations in pixel-aligned 3D: (1) it recovers both visible and invisible points with a complete scene representation, and (2) it produces physically plausible geometry with fewer duplicated structures in overlapping regions. To achieve this, we introduce a scene-token mechanism that aggregates information across unposed images and a diffusion-based 3D decoder that reconstructs complete, non-pixel-aligned point clouds. Extensive experiments on both scene-level and object-level datasets demonstrate that NOVA3R outperforms state-of-the-art…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is well-written and well-presented. The provided figures are clear and informative. The accompanying video shows clearly the method in action. - The idea of predicting non-pixel-aligned geometry is novel and interesting. While the literature of 3D generation is rich, framing the tasks as an intersection of DUST3R-style prediction and 3D-native prediction is clever and refreshing. - The method achieves state-of-the-art performance both on the SCRREAM dataset and NRGBD/7-Scenes dataset
- The runtime performance of the flow-matching decoder is not analyzed. It would be nice to show how the proposed method compares to a regular point cloud decoder without the diffusion process. - It is unclear how image-based properties, such as camera poses or intrinsics, or depth maps, can be derived. While the entire scene is put under the coordinate system of the first view, it is not clear how accurate it is for other views. Pose or depth accuracy evaluation is not provided. - It is not cle
As a non-pixel-aligned 3D reconstruction framework, NOVA3R decouples reconstruction from pixel-ray binding via a global Scene Token mechanism. It completes occluded regions, addressing the flaw of traditional pixel-aligned methods (e.g., DUSt3R, VGGT).
1. Innovation Needs Further Quantification. The core innovation of the paper lies in the synergy between the non-pixel-aligned paradigm and Scene Token. It is recommended to quantify the token’s contribution to global structure modeling. 2. Qualitative Results Require Objective Quantitative Support. The core goal of qualitative experiments (Figures 6, 7) is to verify "occlusion completion" and "physical plausibility," but current evaluations rely solely on visual judgment, leading to potential
1. By modeling the entire 3D reconstruction task as a two-stage generative process, NOVA3R breaks the traditional pixel-aligned paradigm that estimates geometric attributes by tying geometry to per-ray predictions. This innovative design decouples reconstruction from pixel alignment, enabling the model to learn a global, view-agnostic scene representation and thus reconstruct point clouds of both visible and invisible (occluded) regions, addressing the incompleteness limitation of pixel-aligned
1. Compared with end-to-end architectures, NOVA3R requires additional training efforts due to its two-stage design. The model first trains a 3D latent autoencoder (Stage 1) with a flow-matching loss to compress complete point clouds into latent tokens and decode them, then optimizes the image encoder and learnable scene tokens (Stage 2) to map unposed images to the latent space. This two-stage training not only increases the overall training pipeline complexity but also demands more computationa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Computer Graphics and Visualization Techniques
