TL;DR
This paper introduces 3D-Implicit Depth Emergence, a novel approach enabling 3D perception to naturally emerge in multimodal models through geometric self-supervision, improving efficiency and performance in indoor scene understanding.
Contribution
The method reframes 3D perception as an emergent property from geometric self-supervision, eliminating the need for explicit 3D encoding or external models, and reduces inference latency.
Findings
Outperforms state-of-the-art on multiple 3D scene benchmarks.
Achieves 55% reduction in inference latency.
Enables dependency-free 3D understanding in visual-language models.
Abstract
Leveraging 3D information within Multimodal Large Language Models (MLLMs) has recently shown significant advantages for indoor scene understanding. However, existing methods, including those using explicit ground-truth 3D positional encoding and those grafting external 3D foundation models for implicit geometry, struggle with the trade-off in 2D-3D representation fusion, leading to suboptimal deployment. To this end, we propose 3D-Implicit Depth Emergence, a method that reframes 3D perception as an emergent property derived from geometric self-supervision rather than explicit encoding. Our core insight is the Implicit Geometric Emergence Principle: by strategically leveraging privileged geometric supervision through mechanisms like a fine-grained geometry validator and global representation constraints, we construct an information bottleneck. This bottleneck forces the model to maximize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
