Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions

Jintang Xue; Ganning Zhao; Jie-En Yao; Hong-En Chen; Yue Hu; Meida Chen; Suya You; C.-C. Jay Kuo

arXiv:2507.14555·cs.CV·December 9, 2025

Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions

Jintang Xue, Ganning Zhao, Jie-En Yao, Hong-En Chen, Yue Hu, Meida Chen, Suya You, C.-C. Jay Kuo

PDF

Open Access

TL;DR

Descrip3D introduces a framework that improves 3D scene understanding by explicitly encoding object relationships with natural language, enhancing reasoning across tasks without extra supervision.

Contribution

It is the first to integrate object-level textual descriptions into 3D scene models for improved relational understanding and task performance.

Findings

01

Outperforms baseline models on five benchmark datasets.

02

Enhances reasoning in grounding, captioning, and question answering.

03

Effectively encodes object relationships using natural language.

Abstract

Understanding 3D scenes goes beyond simply recognizing objects; it requires reasoning about the spatial and semantic relationships between them. Current 3D scene-language models often struggle with this relational understanding, particularly when visual embeddings alone do not adequately convey the roles and interactions of objects. In this paper, we introduce Descrip3D, a novel and powerful framework that explicitly encodes the relationships between objects using natural language. Unlike previous methods that rely only on 2D and 3D embeddings, Descrip3D enhances each object with a textual description that captures both its intrinsic attributes and contextual relationships. These relational cues are incorporated into the model through a dual-level integration: embedding fusion and prompt-level injection. This allows for unified reasoning across various tasks such as grounding,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning