AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference   Understanding

Hao Guo; Wei Fan; Baichun Wei; Jianfei Zhu; Jin Tian; Chunzhi Yi; Feng; Jiang

arXiv:2411.08451·cs.CV·November 14, 2024

AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding

Hao Guo, Wei Fan, Baichun Wei, Jianfei Zhu, Jin Tian, Chunzhi Yi, Feng, Jiang

PDF

Open Access

TL;DR

This paper presents Attention-Dynamic DINO, a novel framework that improves embodied reference understanding by integrating visual and textual cues, leveraging distance-aware gestures, and surpassing human performance in object localization tasks.

Contribution

The paper introduces a distance-aware, attention-dynamic approach that enhances gesture-based referent prediction, extending the virtual touch line mechanism and achieving state-of-the-art results.

Findings

01

Achieves 76.4% accuracy at 0.25 IoU threshold.

02

Surpasses human performance at 0.75 IoU threshold.

03

Outperforms previous distance-unaware methods across contexts.

Abstract

Embodied reference understanding is crucial for intelligent agents to predict referents based on human intention through gesture signals and language descriptions. This paper introduces the Attention-Dynamic DINO, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts. Our approach integrates visual and textual features to simultaneously predict the target object's bounding box and the attention source in pointing gestures. Leveraging the distance-aware nature of nonverbal communication in visual perspective taking, we extend the virtual touch line mechanism and propose an attention-dynamic touch line to represent referring gesture based on interactive distances. The combination of this distance-aware approach and independent prediction of the attention source, enhances the alignment between objects and the gesture represented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Topic Modeling · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Softmax · Linear Layer · Dense Connections · Layer Normalization · Multi-Head Attention · Residual Connection · Vision Transformer · self-DIstillation with NO labels