OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding

Zixian Liu; Zhaoxi Chen; Liang Pan; Ziwei Liu

arXiv:2601.16538·cs.CV·March 9, 2026

OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding

Zixian Liu, Zhaoxi Chen, Liang Pan, Ziwei Liu

PDF

Open Access

TL;DR

OnlineSI enables large language models to continuously understand and reason about 3D environments using spatial memory and multimodal data, advancing real-world embodied system applications.

Contribution

The paper introduces OnlineSI, a novel framework that maintains finite spatial memory and integrates 3D point cloud with semantic info for improved spatial understanding.

Findings

01

Effective spatial understanding demonstrated on two datasets.

02

Fuzzy F1-Score mitigates ambiguity in evaluation.

03

Framework supports real-world embodied system deployment.

Abstract

In recent years, researchers have increasingly been interested in how to enable Multimodal Large Language Models (MLLM) to possess spatial understanding and reasoning capabilities. However, most existing methods overlook the importance of the ability to continuously work in an ever-changing world, and lack the possibility of deployment on embodied systems in real-world environments. In this work, we introduce OnlineSI, a framework that can continuously improve its spatial understanding of its surroundings given a video stream. Our core idea is to maintain a finite spatial memory to retain past observations, ensuring the size of the spatial memory does not increase as the input accumulates. We further integrate 3D point cloud information with semantic information, helping MLLM to better locate and identify objects in the scene. To evaluate our method, we introduce the Fuzzy $F_{1}$ -Score…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · 3D Shape Modeling and Analysis