CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding
Fevziye Irem Eyiokur, Dogucan Yaman, Haz{\i}m Kemal Ekenel, Alexander Waibel

TL;DR
This paper introduces CAPE, a novel ensemble framework leveraging dual models and CLIP features to improve embodied reference understanding by effectively utilizing pointing cues and scene context.
Contribution
It proposes a dual-model approach with a CLIP-aware ensemble and Gaussian ray heatmaps to better interpret pointing gestures for object localization.
Findings
Achieves 75.0 mAP on YouRefIt benchmark.
Sets new state-of-the-art CLIP and C_D scores.
Demonstrates robustness on unseen datasets.
Abstract
We address Embodied Reference Understanding, the task of predicting the object a person in the scene refers to through pointing gesture and language. This requires multimodal reasoning over text, visual pointing cues, and scene context, yet existing methods often fail to fully exploit visual disambiguation signals. We also observe that while the referent often aligns with the head-to-fingertip direction, in many cases it aligns more closely with the wrist-to-fingertip direction, making a single-line assumption overly limiting. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To fuse their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
