Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

Haifeng Huang; Yilun Chen; Zehan Wang; Jiangmiao Pang; Zhou Zhao

arXiv:2603.27507·cs.CV·April 28, 2026

Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

Haifeng Huang, Yilun Chen, Zehan Wang, Jiangmiao Pang, Zhou Zhao

PDF

TL;DR

Chat-Scene++ introduces a novel 3D scene understanding framework that uses context-rich object sequences for improved object grounding and reasoning in complex environments, achieving state-of-the-art results.

Contribution

It presents a new object-centric, sequence-based representation for 3D scenes that enhances multi-modal reasoning without additional task-specific training.

Findings

01

Achieves state-of-the-art on five 3D vision-language benchmarks.

02

Effectively captures inter-object relationships and global semantics.

03

Supports grounded chain-of-thought reasoning in 3D environments.

Abstract

Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.