AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding
Hao Guo, Wei Fan, Baichun Wei, Jianfei Zhu, Jin Tian, Chunzhi Yi, Feng, Jiang

TL;DR
This paper presents Attention-Dynamic DINO, a novel framework that improves embodied reference understanding by integrating visual and textual cues, leveraging distance-aware gestures, and surpassing human performance in object localization tasks.
Contribution
The paper introduces a distance-aware, attention-dynamic approach that enhances gesture-based referent prediction, extending the virtual touch line mechanism and achieving state-of-the-art results.
Findings
Achieves 76.4% accuracy at 0.25 IoU threshold.
Surpasses human performance at 0.75 IoU threshold.
Outperforms previous distance-unaware methods across contexts.
Abstract
Embodied reference understanding is crucial for intelligent agents to predict referents based on human intention through gesture signals and language descriptions. This paper introduces the Attention-Dynamic DINO, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts. Our approach integrates visual and textual features to simultaneously predict the target object's bounding box and the attention source in pointing gestures. Leveraging the distance-aware nature of nonverbal communication in visual perspective taking, we extend the virtual touch line mechanism and propose an attention-dynamic touch line to represent referring gesture based on interactive distances. The combination of this distance-aware approach and independent prediction of the attention source, enhances the alignment between objects and the gesture represented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Topic Modeling · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Softmax · Linear Layer · Dense Connections · Layer Normalization · Multi-Head Attention · Residual Connection · Vision Transformer · self-DIstillation with NO labels
