Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

Mahmoud Ahmed; Junjie Fei; Jian Ding; Eslam Mohamed Bakr; Mohamed Elhoseiny

arXiv:2405.18937·cs.CV·August 5, 2025

Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

Mahmoud Ahmed, Junjie Fei, Jian Ding, Eslam Mohamed Bakr, Mohamed Elhoseiny

PDF

Open Access

TL;DR

This paper introduces Kestrel, a 3D multimodal large language model designed for fine-grained, part-aware grounded descriptions of 3D objects, supported by a new dataset and advanced reasoning mechanisms for robotic applications.

Contribution

The paper presents Kestrel, a novel part-aware 3D multimodal LLM, and introduces the 3DCoMPaT-GrIn dataset for fine-grained part-level segmentation grounding tasks.

Findings

01

Kestrel effectively bridges language understanding and 3D segmentation grounding.

02

The 3DCoMPaT-GrIn dataset enables detailed part-aware 3D scene understanding.

03

Experiments show improved spatial reasoning at the part level for robotic tasks.

Abstract

In this paper, we introduce Part-Aware Point Grounded Description (PaPGD), a challenging task aimed at advancing 3D multimodal learning for fine-grained, part-aware segmentation grounding and detailed explanation of 3D objects. Existing 3D datasets largely focus on either vision-only part segmentation or vision-language scene segmentation, lacking the fine-grained multimodal segmentation needed for robotic navigation and interaction in real-world environments. To address this gap, we present the 3DCoMPaT Grounded Instructions (3DCoMPaT-GrIn) Dataset, a comprehensive resource that pairs rich point cloud descriptions with corresponding part-level segmentation masks. This dataset encompasses extensive samples designed for both PaPGD and fine-grained single-part grounding tasks. To tackle the inherent challenges of grounding objects and generating grounded descriptions at the part level, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques