CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion

Mouad Abrini; Mohamed Chetouani

arXiv:2602.08999·cs.RO·February 10, 2026

CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion

Mouad Abrini, Mohamed Chetouani

PDF

Open Access

TL;DR

CLUE introduces a novel approach that transforms the internal attention mechanisms of vision-language models into explicit signals for ambiguity detection, enabling robots to better interpret human intentions and decide when to ask for clarification during interaction.

Contribution

The paper presents a method to convert cross-modal attention into an explicit ambiguity detection signal, improving interactive visual grounding in human-robot interaction.

Findings

01

Outperforms state-of-the-art methods with InViG supervision

02

Ambiguity detector surpasses prior baselines

03

Uses parameter-efficient fine-tuning with LoRA

Abstract

With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM's cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Speech and dialogue systems