Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI
Chengyuan Xu, Radha Kumaran, Noah Stier, Kangyou Yu, Tobias H\"ollerer

TL;DR
This paper introduces a multimodal 3D representation combining semantic, linguistic, and geometric data, enabling in-situ machine learning for AR applications like spatial search and inventory management.
Contribution
It presents a novel multimodal 3D reconstruction pipeline and in-situ learning framework that integrate vision-language features for enhanced AR environment understanding.
Findings
Effective fusion of CLIP features into 3D models.
Successful demonstration of spatial search and inventory tracking in AR.
Open-source implementation and demo data provided.
Abstract
Seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment. AR research has long focused on the potential of context awareness, demonstrating novel capabilities that leverage the semantics in the 3D environment for various object-level interactions. Meanwhile, the computer vision community has made leaps in neural vision-language understanding to enhance environment perception for autonomous tasks. In this work, we introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation, enabling user-guided machine learning involving physical objects. We first present a fast multimodal 3D reconstruction pipeline that brings linguistic understanding to AR by fusing CLIP vision-language features into the environment and object models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Video Surveillance and Tracking Methods
MethodsContrastive Language-Image Pre-training · Attentive Walk-Aggregating Graph Neural Network
