CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

Fevziye Irem Eyiokur; Dogucan Yaman; Haz{\i}m Kemal Ekenel; Alexander Waibel

arXiv:2507.21888·cs.CV·December 12, 2025

CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

Fevziye Irem Eyiokur, Dogucan Yaman, Haz{\i}m Kemal Ekenel, Alexander Waibel

PDF

TL;DR

This paper introduces CAPE, a novel ensemble framework leveraging dual models and CLIP features to improve embodied reference understanding by effectively utilizing pointing cues and scene context.

Contribution

It proposes a dual-model approach with a CLIP-aware ensemble and Gaussian ray heatmaps to better interpret pointing gestures for object localization.

Findings

01

Achieves 75.0 mAP on YouRefIt benchmark.

02

Sets new state-of-the-art CLIP and C_D scores.

03

Demonstrates robustness on unseen datasets.

Abstract

We address Embodied Reference Understanding, the task of predicting the object a person in the scene refers to through pointing gesture and language. This requires multimodal reasoning over text, visual pointing cues, and scene context, yet existing methods often fail to fully exploit visual disambiguation signals. We also observe that while the referent often aligns with the head-to-fingertip direction, in many cases it aligns more closely with the wrist-to-fingertip direction, making a single-line assumption overly limiting. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To fuse their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.