BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence
Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, Zhizhong Su

TL;DR
BIP3D is a novel image-centric 3D perception model that leverages pre-trained vision models and explicit 3D encoding to improve embodied agents' understanding of their environment, outperforming point cloud methods.
Contribution
The paper introduces BIP3D, a new approach combining image features and 3D encoding, enhancing perception performance beyond traditional point cloud-based methods.
Findings
Outperforms state-of-the-art on EmbodiedScan benchmark
Achieves 5.69% improvement in 3D detection
Achieves 15.25% improvement in 3D visual grounding
Abstract
In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments. Previous algorithms primarily rely on point cloud, which, despite offering precise geometric information, still constrain perception performance due to inherent sparsity, noise, and data scarcity. In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods. Specifically, we leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding. Together, these modules enable BIP3D to achieve multi-view, multi-modal feature fusion and end-to-end 3D perception. In our experiments, BIP3D outperforms current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Human Pose and Action Recognition · 3D Surveying and Cultural Heritage
