Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description
Mahmoud Ahmed, Junjie Fei, Jian Ding, Eslam Mohamed Bakr, Mohamed Elhoseiny

TL;DR
This paper introduces Kestrel, a 3D multimodal large language model designed for fine-grained, part-aware grounded descriptions of 3D objects, supported by a new dataset and advanced reasoning mechanisms for robotic applications.
Contribution
The paper presents Kestrel, a novel part-aware 3D multimodal LLM, and introduces the 3DCoMPaT-GrIn dataset for fine-grained part-level segmentation grounding tasks.
Findings
Kestrel effectively bridges language understanding and 3D segmentation grounding.
The 3DCoMPaT-GrIn dataset enables detailed part-aware 3D scene understanding.
Experiments show improved spatial reasoning at the part level for robotic tasks.
Abstract
In this paper, we introduce Part-Aware Point Grounded Description (PaPGD), a challenging task aimed at advancing 3D multimodal learning for fine-grained, part-aware segmentation grounding and detailed explanation of 3D objects. Existing 3D datasets largely focus on either vision-only part segmentation or vision-language scene segmentation, lacking the fine-grained multimodal segmentation needed for robotic navigation and interaction in real-world environments. To address this gap, we present the 3DCoMPaT Grounded Instructions (3DCoMPaT-GrIn) Dataset, a comprehensive resource that pairs rich point cloud descriptions with corresponding part-level segmentation masks. This dataset encompasses extensive samples designed for both PaPGD and fine-grained single-part grounding tasks. To tackle the inherent challenges of grounding objects and generating grounded descriptions at the part level, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
