UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation
Jiaying Lin, Dan Xu

TL;DR
UniFunc3D introduces a unified, training-free framework that leverages large language models for active, spatial-temporal reasoning to improve 3D functionality segmentation in complex scenes.
Contribution
It presents a novel active spatial-temporal grounding approach that integrates semantic, temporal, and spatial reasoning in a single pass without task-specific training.
Findings
Achieves 59.9% mIoU improvement on SceneFun3D
Surpasses existing training-free and training-based methods
Demonstrates effective coarse-to-fine reasoning strategy
Abstract
Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis
