Online Segment Any 3D Thing as Instance Tracking
Hanshi Wang, Zijian Cai, Jin Gao, Yiwei Zhang, Weiming Hu, Ke Wang, Zhipeng Zhang

TL;DR
This paper presents AutoSeg3D, a novel online 3D segmentation method that incorporates instance tracking and temporal information propagation to improve spatial understanding and robustness in embodied agents.
Contribution
It introduces a new approach that models online 3D segmentation as an instance tracking problem, utilizing object queries for temporal propagation and spatial consistency learning.
Findings
Surpasses ESAM by 2.8 AP on ScanNet200
Achieves consistent improvements on ScanNet, SceneNN, and 3RScan datasets
Enhances 3D segmentation with efficient temporal information exchange
Abstract
Online, real-time, and fine-grained 3D segmentation constitutes a fundamental capability for embodied intelligent agents to perceive and comprehend their operational environments. Recent advancements employ predefined object queries to aggregate semantic information from Vision Foundation Models (VFMs) outputs that are lifted into 3D point clouds, facilitating spatial information propagation through inter-query interactions. Nevertheless, perception is an inherently dynamic process, rendering temporal understanding a critical yet overlooked dimension within these prevailing query-based pipelines. Therefore, to further unlock the temporal environmental perception capabilities of embodied agents, our work reconceptualizes online 3D segmentation as an instance tracking problem (AutoSeg3D). Our core strategy involves utilizing object queries for temporal information propagation, where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Generative Adversarial Networks and Image Synthesis
