Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images
Bo Zhou, Qiuxia Lai, Zeren Sun, Xiangbo Shu, Yazhou Yao, Wenguan Wang

TL;DR
This paper introduces UniSplat, a novel feed-forward framework for learning robust 3D representations from unposed multi-view images, addressing geometry, appearance, and semantic inconsistencies.
Contribution
UniSplat combines dual masking, coarse-to-fine Gaussian splatting, and pose-conditioned recalibration to improve 3D representation learning from unposed images.
Findings
Enhances geometry induction in unposed multi-view images.
Reduces appearance and semantic inconsistencies through progressive refinement.
Enforces geometric-semantic consistency for robust 3D representations.
Abstract
Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics. We introduce UniSplat, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a dual-masking strategy that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
