Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention
Weitai Kang, Mengxue Qu, Jyoti Kini, Yunchao Wei, Mubarak Shah, Yan, Yan

TL;DR
Intent3D introduces a novel task of 3D object detection based on human intention in RGB-D scans, along with a new dataset and a specialized model to understand and reason about human goals in 3D environments.
Contribution
The paper presents the first dataset and baseline models for 3D intention grounding, enabling AI to detect objects based solely on human intentions without explicit references.
Findings
Intent3D dataset contains 44,990 intention texts and 209 classes.
Baseline models demonstrate the feasibility of intention-based 3D object detection.
IntentNet outperforms existing methods by focusing on intention understanding and reasoning.
Abstract
In real-life scenarios, humans seek out objects in the 3D world to fulfill their daily needs or intentions. This inspires us to introduce 3D intention grounding, a new task in 3D object detection employing RGB-D, based on human intention, such as "I want something to support my back". Closely related, 3D visual grounding focuses on understanding human reference. To achieve detection based on human intention, it relies on humans to observe the scene, reason out the target that aligns with their intention ("pillow" in this case), and finally provide a reference to the AI system, such as "A pillow on the couch". Instead, 3D intention grounding challenges AI agents to automatically observe, reason and detect the desired target solely based on human intention. To tackle this challenge, we introduce the new Intent3D dataset, consisting of 44,990 intention texts associated with 209…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper presents a clear motivation for 3D-intention grounding, and it includes clear illustrations and presentations of dataset collection procedure. 2. Soundness of each component design of the IntentNet, and thoroughly ablations on each component of the proposed pipeline design. 3. Extensive experiments and discussions demonstrate the effectiveness of the proposed framework compared to different types of baselines.
Major concern: I am concerned about possible baseline unfair comparison in the experiment section. Most baselines are designed to tackle nouns-types of questions instead of human-intention types of questions. What if we pass the question to a finetuned LLM and let it infers what types of nouns/objects the question is targeting at from possible objects in a scene detected by existed 3D object detectors? The possible performance of these baselines might be much higher after it is given the object
1: A new task in 3D object detection employing RGB-D, based on human intention, facilitates smoother and more natural communication between humans and intelligent agents. 2:The author propose a high-quality vision-language dataset and focuses on the human’s intention for 3D object detection, which will facilitate the progress of 3D scene understanding.
1: There has been a few methods to combine 3D scene understanding with LLM beyond Chat3D v2, such as LL3DA, Grounded 3D-LLM, ReGround3D and so on. The paper does not highlight the advantages compared to them. 2: The object selection method is too crude, as it removes some commonly used objects by humans when filtering Non-trivial Objects. Figures 3 (d) and (e) indicate that the dataset lacks sufficient diversity in the types of objects included. 3: The limited variety of object category inclu
1. The overall quality of the paper is high, with clear writing and easy-to-understand presentation. 2. The contribution of the dataset is significant, as it is the first to construct a 3D detection task focused on intention understanding.
1. More comparisons with recent works should be provided in Tables 1 and 2. Additionally, there is a minor mistake: the detector names “GroupFree” and “Group-Free” in the first two rows of Tables 1 and 2 do not match. 2. The article gives a subtractive ablation experiment. I would like to see an additive ablation experiment, such as how the effect of verb alone works. 3. The article does not give the performance of the proposed IntentNet in traditional 3D grounding.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Advanced Neural Network Applications · Visual Attention and Saliency Detection
