OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding
Zixian Liu, Zhaoxi Chen, Liang Pan, Ziwei Liu

TL;DR
OnlineSI enables large language models to continuously understand and reason about 3D environments using spatial memory and multimodal data, advancing real-world embodied system applications.
Contribution
The paper introduces OnlineSI, a novel framework that maintains finite spatial memory and integrates 3D point cloud with semantic info for improved spatial understanding.
Findings
Effective spatial understanding demonstrated on two datasets.
Fuzzy F1-Score mitigates ambiguity in evaluation.
Framework supports real-world embodied system deployment.
Abstract
In recent years, researchers have increasingly been interested in how to enable Multimodal Large Language Models (MLLM) to possess spatial understanding and reasoning capabilities. However, most existing methods overlook the importance of the ability to continuously work in an ever-changing world, and lack the possibility of deployment on embodied systems in real-world environments. In this work, we introduce OnlineSI, a framework that can continuously improve its spatial understanding of its surroundings given a video stream. Our core idea is to maintain a finite spatial memory to retain past observations, ensuring the size of the spatial memory does not increase as the input accumulates. We further integrate 3D point cloud information with semantic information, helping MLLM to better locate and identify objects in the scene. To evaluate our method, we introduce the Fuzzy -Score…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · 3D Shape Modeling and Analysis
