Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
Haifeng Huang, Yilun Chen, Zehan Wang, Jiangmiao Pang, Zhou Zhao

TL;DR
Chat-Scene++ introduces a novel 3D scene understanding framework that uses context-rich object sequences for improved object grounding and reasoning in complex environments, achieving state-of-the-art results.
Contribution
It presents a new object-centric, sequence-based representation for 3D scenes that enhances multi-modal reasoning without additional task-specific training.
Findings
Achieves state-of-the-art on five 3D vision-language benchmarks.
Effectively captures inter-object relationships and global semantics.
Supports grounded chain-of-thought reasoning in 3D environments.
Abstract
Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
